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1  IntixMluction 


A  synopsis  of  this  research  program  is  provided  below.  General  motivation  for 
the  use  of  leaiTiing  in  control  is  provided  as  background  material  in  Section  1.1. 
An  overview  of  this  report  follows  in  Section  1.2.  A  list  of  technical  publications 
^  produced  during  the  course  of  this  v’ork  is  provided  in  Section  1.3. 

This  report  describes  results  obtained  during  a  multiphase  research  program 
having  the  broad  aim  of  investigating  the  application  of  learning  systems  to  au¬ 
tomatic  control  in  general,  and  to  flight  control  in  particular.  The  first  phase  an¬ 
alyzed  the  original  drive-reinforcement  learning  paradigm  [Klopf  (1988)]  and  ex¬ 
amined  its  application  to  automatic  control,  with  mixed  results.  The  second 
phase  compared  a  number  of  alternative  control  strategies  including  conventional 
linear  control  [Friedland  (1986)],  adaptive  control  [Astrom  &  Wittenmark  (1989)], 
and  other  reinforcement  learning  control  methods  [Barto,  Sutton,  &  Anderson 
(1983)1,  and  resulted  in  the  conception  of  a  new  hybrid  adaptive/leaming  control 
scheme  (Baker  &  Farrell  (1990)].  Subsequently,  in  the  third  phase,  this  hybrid 
control  approach  was  more  felly  developed  and  applied  to  several  nonlinear 
dynamical  .systems,  including  a  cart-pole  system,  aeroelastic  oscillator,  and  a 
three-degree-of-freedom  high  performance  aircraft.  The  fourth  phase  revisited 
drive-reinforcement  learning  from  the  point  of  view  of  optimal  control  and  suc¬ 
cessfully  applied  a  version  embedded  in  the  associative  control  process  architec¬ 
ture  [Klopf,  Morgan,  Weaver  (1992)1  to  regulate  an  aeroelastic  oscillator.  The  fifth 
phase  examined  the  problem  of  learning  augmented  estimation,  and  resulted  in 
the  development  o'"  a  preliminary  estimation  scheme  consistent  with  the  hybrid 
adaptive/leaming  control  approach.  In  the  sixth  and  final  phase,  the  hybrid  con¬ 
trol  methodology  was  applied  to  a  nonlinear,  six-degree-of-freedom  flight  control 
problem,  and  then  successfully  demonstrated  via  a  challenging  multiaxis  ma- 
«  neuver. 

Imtial  work  with  the  basic  drive-reinforcement  (D-R)  learning  algorithm  showed 
«  considerable  promise  for  its  application  to  automatic  control.  However,  it  was 

soon  demonstrated  that  without  added  functionality  the  basic  algorithm  could  not 
serve  alone  as  a  learning  controller,  at  least  not  in  the  usual  sense  of  what  is 
meant  by  learning  control.  Moreover,  an  examination  of  a  number  of  alternative 
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strategies  (including  other  reinforcement  learning  strategies)  revealed  many 
candidates  with  both  advantages  and  disadvantages,  but  none  with  a  clear  domi¬ 
nance  over  the  others  (particularly  in  the  context  of  flight  control).  During  this 
period,  a  novel  hybrid  adaptive/learning  control  scheme  [Bai  d  &  Baker  (1990); 
Baker  Sc  Farrell  (1990)1  was  conceived  that  provided  many  of  the  advantages  seen 
among  the  candidates  considered,  yet  that  avoided  many  of  their  disadvantages. 
In  light  of  this,  a  decision  was  made  to  pursue  this  new  approach  in  lieu  of 
others.  At  the  same  time,  emphasis  was  placed  on  the  use  of  learning  to  address 
problems  in  control  related  to  uncertainty  and  nonlinearity,  rather  than  to 
problems  related  to  optimization  and  implicit  behavioral  objectives. 

Accordingly,  development  and  refinement  of  the  hybrid  adaptive/learning  control 
methodology  continued  during  a  substantial  portion  of  the  program.  This  ap¬ 
proach  was  successfully  applied  to  a  number  of  nonlinear  dynamical  systems, 
culminating  in  its  application  to  a  multiaxis  flight  control  problem.  The  learning 
augmented  flight  control  system  was  constructed  by  augmenting  a  simple  linear 
compensator  design  with  both  an  adaptive  and  a  learning  capability.  The  model- 
based  linear  compensator  was  designed  following  a  procedure  similar  to  that  de¬ 
scribed  in  [Anderson  &  Schmidt  (1991)].  The  adaptive  compensator  was  developed 
by  incorporating  and  extending  ideas  presented  in  [Youcef-Toumi  &  Ito  (1990)]. 
Finally,  a  hybrid  adaptive/learning  control  system  was  developed  by  combining 
the  same  adaptive  compensator  with  a  spatially  localized  learning  system  based 
on  a  linear-Gaussian  network  [Baker  &  Farrell  (1990);  Millington  (1991)1.  The  hy¬ 
brid  adaptive/learning  flight  control  system  was  successfully  demonstrated  on  the 
6-DOF  nonlinear  aircraft  model  via  a  challenging  multiaxis  maneuver.  To  illus¬ 
trate  the  benefit  of  learning  augmentation,  the  basic  linear  and  adaptively  aug¬ 
mented  compensator  designs  were  used  as  baseline  controllers. 

In  the  latter  part  of  the  program,  a  decision  was  made  to  revisit,  from  the  point  of 
view  of  optimal  control,  the  D-R  learning  paradigm  and,  more  generally,  the  new 
associative  control  process  (AGP)  architecture  [Klopf,  Morgan,  Weaver  (1992)]  in 
which  ii  was  embedded.  It  was  found  that  the  AGP  arcliitecture  provided  the  ad¬ 
ditional  functionality  needed  by  the  original  D-R  algorithm  to  allow  it  to  be  used 
for  optimal  control.  Although  it  was  too  late  in  the  program  timeline  to  consider 
its  application  to  flight  control,  an  A('P-based  controller  was  developed  and  suc- 
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cessfully  applied  to  the  problem  of  regulating  the  output  of  a  nonlinear  aeroelastic 
oscillator  model,  in  an  optimal  fashion. 

Throughout  this  research  program,  software  development  was  substantially  fa¬ 
cilitated  through  the  use  of  a  custom  simulation  environment  known  as  NetSim 
[Alexander,  et  al.  (1991)]  tliat  was  designed  especially  for  the  investigation  of  con- 
nectionist  network  based  learning  systems,  and  also  by  the  existence  of  a  repre¬ 
sentative  high  performance  nonlinear  aircraft  model  in  FORTRAN  [Brumbaugh 
(1990)].  New  software  development  was  essentially  limited  to  the  creation  of  Net¬ 
Sim  modules  for  various  example  applications  and  to  the  conversion  of  the  FOR¬ 
TRAN  aircraft  model  into  a  C-based  NetSim  module.  In  addition,  further  devel¬ 
opment  and  refinement  of  these  modules  and  of  the  NetSim  application  was  per¬ 
formed. 

The  aircraft  code  used  in  this  work  was  derived  from  a  six-degree-of-freedom  (6- 
DOF)  high  performance  aircraft  model,  incorporating  nonlinear  aerodynamic  ef¬ 
fects  (based  on  empirically  derived  tabular  data),  nonlinear  engine  dynamics,  and 
nonlinear  actuator  dynamics  (including  rate  and  position  limits).  This  code  is  a 
slightly  modified  version  of  an  F-15  simulation  developed  by  NASA/Dryden.  A 
more  detailed  description  of  the  basic  aircraft  model  and  its  FORTRAN  imple¬ 
mentation  can  be  found  in  [Brumbaugh  (1990)]. 

Results  obtained  during  this  research  program  clearly  demonstrate  many  of  the 
potential  benefits  of  learning  augmented  control  and  especially  the  advantages 
that  may  be  gained  in  terms  design  facilitation,  automatic  accommodation  of  un¬ 
certainty,  on-line  performance  optimization,  and  operational  efficiency.  The  bot¬ 
tom  line  is  that  learning  augmentation  is  beneficial  to  automatic  control  in  gen¬ 
eral  and  to  flight  control,  in  particular.^  Although  significant  progress  was  made 
during  this  research  program,  these  results  also  serve  to  indicate  that  further 
work  is  needed.  Topics  for  future  research  and  development  include: 

•  further  development  of  the  hybrid  control  and  estimation  methodology 


^  This  claim  is  further  supported  by  a  second  research  program  funded  by  the  Navy  (under  USN 
Contract  No.  N62269-91-C-0033)  which  involved  the  application  of  tlie  hybrid  adaptive/ 
learning  control  technique  conceived  and  developed  in  this  program,  to  a  full  subsonic 
envelope,  handling  qualities  improvement  system  for  a  high  performance  aircraft 
[Millington  &  Baker  (1992)]. 
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•  development  and  refinement  of  variable  structure  learning  syetems 

•  further  investigation  of  reinforcement  learning  in  the  context  of  optimal 
control  and  multiplayer  game  problems 

•  research  and  development  of  continuous  input/output  reinforcement 
learning  systems 

1.1  Badkgrwiod!  Motivattoai  for  I  ieaming  Contix)! 

Advanced  control  systems  for  autonomous  or  highly  automated  systems  are  ex¬ 
pected  to  maintain  closed-loop  system  stsbility  and  performance  over  a  wide  range 
of  operating  conditions  and  events.  This  objective  can  ’  difficult  to  achieve  due  to 
the  complexity  of  both  the  plant  (i.e.,  the  system  to  be  controlled)  and  the  perfor¬ 
mance  objectives,  and  due  to  the  presence  of  uncertainty.  Such  complications 
may  result  from  nonlinear  or  time-varying  behavior,  poorly  modeled  plant 
dynamics,  high  dimensionality,  multiple  inputs  and  outputs,  complex  objective 
functions,  operational  constraints,  imperfect  measurements,  and  the  possibility 
of  actuator,  sensor,  or  other  component  failures.  Each  of  these  effects,  if  present, 
must  be  addressed  if  the  system  is  to  operate  reliably  in  an  automatic  fashion.  A 
view  strongly  advocated  here  is  that  learning  control  systems  may  be  used  advan¬ 
tageously  to  address  several  of  these  difficulties.  In  particular,  in  this  research 
progiam  we  have  focused  on  the  control  of  complex  dynamical  systems  that  are 
poorly  modeled  and  nonlinear. 

Unfortunately,  it  is  difficxilt  to  provide  a  precise  and  completely  satisfactory  defini¬ 
tion  for  the  term  "learning  control  system."  One  interpretation  that  is,  however, 
consistent  with  the  prevailing  literature  (e.g.,  [Klopf  &  Morgan  (1990)])  is  that: 

A  learning  control  system  is  one  that  has  the  ability  to  improve  its 
performance  in  the  fuiure,  based  on  experiential  information  it  has 
gained  in  the  past,  through  closed-loop  interactions  with  the  plant 
and  its  environment.  ^ 


^  To  help  focus  the  discussion  that  follows  and  avoid  any  unnecessary  controver-sy,  we  will 
further  limit  our  subject  to  include,  primarily,  the  type  of  learning  tliat  one  might  associate 
with  sensorimotor  control,  and  exclude  more  sophisticated  learning  behaviors  (e  g.,  planning 
.^nd  exploration) 
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'fhere  are  several  implications  of  this  statement.  One  implication  is  that  a  learn¬ 
ing  control  system  has  some  autonomous  capability,  since  it  has  the  ability  to  im¬ 
prove  its  own  performance.  Another  is  that  it  is  dynamic,  since  it  may  vary  over 
time.  Yet  another  implication  is  that  it  has  memory,  since  it  can  exploit  past  ex¬ 
perience  to  improve  future  performance.  FinaUy,  to  improve  its  performance,  the 
learning  system  must  operate  in  the  context  of  an  objective  function  and,  more¬ 
over,  it  must  receive  performance  feedback  that  characterises  the  appropriateness 
of  its  current  behavior  in  that  context. 

In  a  fundamental  sense,  the  control  design  problem  is  to  find  an  appropriate 
mapping,  from  measured  plant  outputs  y„  and  desired  plant  outputs  y^,  to  a 
control  action  u  that  will  produce  satisfactory  behavior  in  the  closed-loop  system. 
In  other  words,  the  problem  is  to  choose  a  function  (a  control  law)  u  =  k(y„,yrf,<) 
that  achieves  certain  performance  objectives  when  applied  to  the  open-loop  sys¬ 
tem.  In  turn,  the  solution  to  this  problem  may  naturally  involve  other  mappings; 
e.g.,  a  mapping  from  the  current  plant  operating  condition  to  the  parameters  of  a 
controller  or  local  plant  model,  or  a  mapping  from  measured  plant  outputs  to  es¬ 
timated  plant  state.  Accordingly,  a  learning  system  that  could  be  used  to  synthe¬ 
size  such  mappings  on-line  would  be  an  advantageous  component  of  an  advanced 
control  system.  To  successfully  employ  learning  systems  in  this  manner,  one 
must  have  an  effective  means  for  their  implementation  and  incorporation  into  the 
overall  control  system  architecture.  The  belief  that  connectionist  systems  offer  a 
suitable  means  with  which  to  implement  learning  control  systems  has  been  the 
impetus  for  a  large  body  of  recent  research^  (e  g-,  [Albus  (1975);  Anderson  (1989); 
Atkins  (1993);  Baird  &  Baker  (1990);  Baird  (1991);  Baker  &  Farrell  (1990,  1991, 
1992);  Baker  &  Millington  (1992,  1993);  Barto,  Sutton,  &  Anderson  (1983);  Berger 
(1992);  Cerraio  (^993);  Farrell  &  Baker  (1991,  1992,  forthcoming);  Klopf  &  Morgan 
(1990);  Klopf,  Morgan,  &  Weaver  (1992);  Millington  &  Baker  (1992);  Millington 
(1991);  Millington,  Baker,  &  Koenig  (1993);  Morgan,  Patterson,  &  Klopf  (1990); 
Nistler  (1992);  Steinberg  (1992);  Vos,  Baker,  &  Millington  (1991)]).  Perhaps  a  more 
cogent  statement  of  affairs  is  that,  in  the  context  of  control,  learning  can  be  viewed 
as  the  automatic  incremental  synthesis  of  multivariable  functional  mappings 
and,  moreover,  that  connectionist  systems  provide  a  useful  framework  for  realiz¬ 
ing  such  mappings. 

^  This  is  not  intended  to  be  a  comprehensive  bst 


5 


The  necessity  for  applying  learning  arises  in  situations  where  a  system  must  op¬ 
erate  in  conditions  of  uncertainty,  and  when  the  available  a  priori  information  is 
so  limited  that  it  is  impossible  or  impractical  to  design  in  advance  a  system  that 
has  fixed  properties  and  also  performs  sufficiently  well  [Tsypkin  (1973)].  In  the 
context  of  intelligent  control,  learning  can  be  viewed  as  a  means  of  solving  those 
problems  that  lack  sufficient  a  priori  information  to  allow  a  complete  and  fixed 
(i.e.,  nonadaptable)  control  system  design  to  be  derived  in  advance.  Thus,  a  cen¬ 
tral  role  of  learning  in  intelligent  control  is  to  enable  a  wider  class  of  problems  to 
be  solved,  by  reducing  the  prior  uncertainty  to  the  point  where  satisfactory  solu¬ 
tions  can  be  obtained  on-line.  This  result  is  acliieved  empirically,  by  means  of 
performance  feedback,  association,  and  memory  (or  knowledge  base)  adjustment. 

One  of  the  principal  benefits  of  learning  control,  given  the  present  state  of  its 
technological  development,  derives  from  the  ability  of  leaimng  systems  to  auto¬ 
matically  synthesize  mappings  that  can  be  used  advantageously  within  a  control 
system  architecture.  Examples  of  such  mappings  include  a  controller  mapping 
that  relates  measured  and  desired  plant  outputs  to  an  appropriate  set  of  control 
actions  (Fig.  1.1a),  a  related  control  parameter  mapping  that  generates  parame¬ 
ters  (e.g.,  gains)  for  a  separate  controller  (Fig.  1.1b),  a  model  state  (or  estimator) 
mapping  that  produces  state  estimates  (Fig.  1.1c),  and  a  model  parameter  map¬ 
ping  that  relates  the  plant  operating  condition  to  an  accurate  set  of  model  param¬ 
eters  (Fig.  l.ld).  In  general,  these  mappings  may  represent  dynamic  functions 
(i.e.,  functions  that  involve  temporal  differentiation  or  integration). 

Learning  is  required  when  these  mappings  cannot  be  determined  completely  in 
advance  because  of  a  priori  uncertainty  (e.g.,  modeling  error).  In  a  tjrpical  learn¬ 
ing  control  application,  the  desired  mapping  is  stationary  (i.e.,  does  not  depend 
explicitly  on  time),  and  is  expressed  (implicitly)  in  terms  of  an  objective  function 
involving  the  outputs  of  both  the  plant  and  the  learning  system.  The  objective 
function  is  used  to  provide  performance  feedback  to  the  learning  system,  which  < 

must  then  associate  this  feedback  with  specific  adjustable  elements  of  the  map¬ 
ping  that  is  currently  stored  in  its  memory.  The  underlying  idea  is  that  experi¬ 
ence  can  be  used  to  improve  the  mapping  furnished  by  the  learning  sy.stem.  • 
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Figure  1.1.  Four  Control  Sy‘"tem  Architectures,  Employing  Different  Mappings. 


1.2  Report  Ovesrview 


The  remainder  of  this  report  is  organized  into  a  number  of  chapters  (described  be¬ 
low)  whose  subjects  reflect  the  principal  tasks  pursued  under  this  research  pro¬ 
gram.  In  addition,  three  graduate  student  theses  (also  described  below)  are  in¬ 
cluded  in  their  entirety  as  attachments.  A  total  of  14  teclinical  publications  were 
generated  based  on  w<  rk  that  was  performed  during  the  course  of  this  program 
(see  Section  1.3).  To  minimize  redundancy  as  well  as  the  cost  of  producing  this 
final  report,  extensive  reference  will  be  made  to  the  relevant  paits  of  these  docu¬ 
ments  and  to  certain  articles  contained  in  the  bibliography. 
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1.2.I_JMam  Document 


The  maiM  document  presents  an  overview  of  this  research  program,  including  a 
summary  of  all  activities  and  accomplishments.  The  principal  themes  of  the  re¬ 
maining  chapters  are  outlined  below. 

Chapter  2:  Drive-Reinforcement  Learning  addresses  those  aspects  of  the  program 
that  were  primarily  concerned  with  an  investigation  of  reinforcement  learning 
methods  and  their  application  to  problems  in  automatic  control.  The  focus  of  this 
part  of  the  investigation  was  on  driue-reinforcemejit  learning  [Klopf  (1988)]  and  on 
associative  control  process  networks  [Klopf,  Morgan,  &  Weaver  (1992);  Baird  & 
Klopf  (1992)].  Conventional  alternatives  to  learning  control  are  also  reviewed.  In 
addition,  a  reinforcement  learning  system  based  on  the  use  of  the  associative 
search  element  /  adaptive  critic  element  [Barto,  Sutton,  &  Anderson  (1983)]  was 
examined.  Experimental  results  were  obtained  by  appl5ring  some  of  these  ap¬ 
proaches  to  a  cart-pole  stabilization  and  tracking  problem.  The  use  of  reintbrce- 
ment  learning  in  the  context  of  optimal  control  is  also  examined 

Chapter  3:  Learning  for  Flight  Control  provides  a  high-level  discussion  of  the  mo¬ 
tivation  for,  as  well  as  the  issues  underlying,  the  use  of  learning  in  flight  control 
applications.  Based  on  this  investigation,  new  hybrid  adaptive/learning  control 
architectures  £u-e  conceived. 

Chapter  4:  Hybrid  Adaptive  / Learning  Control  provides  a  detailed  mathematical 
development  of  a  novel  hybrid  adaptive/learning  control  methodology.  In  addi¬ 
tion,  a  preliminary  technique  for  learning  augmented  estimation  which  is  consis¬ 
tent  with  the  hybrid  control  arcliitecture  is  also  presented. 

Chapter  5:  Multiaxis  Flight  Control  addresses  those  aspects  of  the  program  re¬ 
lated  to  the  development  and  demonstration  of  a  learning  augmented  flight  con¬ 
trol  .system  for  a  nonlinear  vehicle  model  representative  of  a  modern  high  per¬ 
formance  aircraft.  A  challenging,  multiaxis  "S  trajectory"  maneuver  is  used  to 
highlight  the  benefits  of  learning  augmentation. 

Chapter  6:  Conclusion  provides  a  summary  of  this  prograin  and  recommenda¬ 
tions  for  future  research. 
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A  bibliography  of  the  references  used  in  the  course  of  this  research  is  included  at 
the  end  of  this  document.  Additional  bibliographies  are  included  at  the  end  of 
each  attachment. 


Attachment  1  is  a  Master's  Thesis,  entitled  Learning  and  Adaptive  Hybrid  Sys- 
terns  for  Nonlinear  Control  [Baird  (1991)1,  that  was  completed  under  this  research 
program.  Tlie  object  of  this  thesis  was  to  find  methods  for  combining  learning 
systems  with  adaptive  systems  so  as  to  achieve  good  control  in  the  presence  of  both 
spatial  and  temporal  functional  dependencies.  Several  methods  were  developed 
for  augmenting  the  estimation  carried  out  by  an  indirect  adaptive  system  with  the 
additional  information  available  from  a  learning  system.  In  addition  to  develop¬ 
ing  a  simple  form  of  learning  augmented  estimation,  various  issues  in  the  con¬ 
struction  and  use  of  connectionist  learning  systems  were  explored  in  this  context. 

Chapter  2:  Background  outlines  some  of  the  important  concepts  and  historical  de¬ 
velopment  of  connectionist  learning  systems,  control  systems,  and  approaches  for 
using  connectionist  learning  systems  for  control. 

Chapter  3:  Hybrid  Control  Architecture  covers  the  adaptive  controller  and  connec¬ 
tionist  networks  that  were  integrated  into  a  single  hybrid  controller.  Both  the  in¬ 
dividual  components  and  the  final,  integrated  system,  are  described  in  detail. 

Chapter  4:  Connectionist  Learning  for  Control  covers  some  of  the  difficulties  as¬ 
sociated  with  learning  systems  used  for  control,  and  describes  various  methods 
that  might  be  used  to  address  those  difficulties. 

Chapter  5:  Experiments  describes  the  various  simulations  performed.  These  re¬ 
sults  are  presented  graphically  and  are  interpreted  in  relation  to  the  original 
research  goals. 

Chapter  6:  Conclusions  and  Recommendations  sumroarizes  what  has  been  ac¬ 
complished,  draws  conclusions,  and  points  out  areas  in  which  future  research 
should  be  focused. 
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Attachment  2  is  a  Master's  Thesis,  entitled  A  Learning  Enhanced  Flight  Control 
System  for  High  Performance  Aircraft  [Nistler  (1992)],  that  was  completed  under 
this  research  program.  This  thesis  explored  the  use  of  a  learning  system  to 
augment  an  adaptive  flight  controller.  The  extent  to  which  learning  can  be  used 
to  improve  an  adaptive  flight  control  system  architectmre,  as  well  as  the  difficul¬ 
ties  introduced  by  learning  augmentation,  were  examined.  The  primary  objective 
of  this  thesis  was  to  illustrate  the  advantages  of  a  hybrid  adaptive/leaming  control 
system  in  terms  of  its  ability  to  accommodate  unmodeled  dynamics  and  to  reduce 
state-dependent  uncertainties  in  the  system  model.  This  hybrid  approach  offers 
advantages  over  conventional  techniques  in  terms  of  performance,  robustness, 
and  design  refinement  costs. 

Chapter  2:  Background  discusses  some  of  the  challenges  associated  with  flight 
control  law  design.  Moreover,  background  information  on  traditional  control 
techniques  is  provided  to  serve  as  a  foundation  for  the  hybrid  control  law  devel¬ 
opment,  and  also  as  a  basis  for  comparison  of  alternative  designs.  The  theoretical 
concepts  underlpng  connectionist  learning  systems,  as  well  as  some  approaches 
to  using  learning  systems  for  control,  are  also  presented. 

Chapter  3:  Technical  Approach  considers  technical  aspects  of  the  hybrid  control 
law.  This  is  accomplished  by  first  presenting  the  underlying  theory  of  the  adap¬ 
tive  system  and  the  spatially  localized  learning  system  before  moving  on  to  a 
derivation  of  the  hybrid  system.  General  characteristics  of  the  hybrid  controller 
are  also  presented. 

Chapter  4:  Experiments  presents  two  examples  to  illustrate  the  implementation 
and  performance  of  the  hybrid  control  law.  The  first  experiment  used  the  hybrid 
system  to  control  a  relatively  simple  nonlinear  aeroelastic  oscillator.  Due  to  the 
low  dimensionality  of  the  plant,  and  the  availability  of  a  truth  model,  the  analysis 
and  evaluation  of  the  hybrid  control  system  for  the  aeroelastic  oscillator  was 
greatly  simplified.  In  the  second  experiment,  the  hybrid  system  was  applied  to  a 
realistic  high  performance  aircraft  model.  Descriptions  of  the  major  components 
of  the  aircraft  model  as  well  as  its  significant  control  characteristics  are  also  pro¬ 
vided.  An  evaluation  of  aircraft  performance  when  controlled  by  the  hybrid  sys- 


tem  is  presented  and  compared  with  other  designs  for  various  simulations. 
Learning  system  characteristics  are  also  described. 

Chapter  5:  Conclusions  and  Recommendations  summarizes  the  major  contribu¬ 
tions  of  this  thesis.  In  addition,  recommendations  for  future  research  are  pre¬ 
sented. 


Attachment  3  is  a  Master's  Thesis,  entitled  Incremental  Synthesis  of  Optimal 
Control  Laws  Using  Learning  Algorithms  [Atkins  (1993)1,  that  was  completed 
under  this  research  program.  The  primary  objective  of  this  thesis  was  to  incre¬ 
mentally  S3mthesize  a  nonlinear  optimal  control  law,  through  real-time,  closed- 
loop  interactions  between  the  dynamic  system,  its  environment,  and  a  learning 
system,  when  substantial  initial  model  uncertainty  exists.  The  dynamic  system 
is  assumed  to  be  nonlinear,  time-invariant,  and  of  known  state  dimension,  but 
otherwise  only  inaccurately  described  by  an  a  priori  model.  The  problem,  there¬ 
fore,  requires  either  explicit  or  implicit  system  identification.  No  disturbances, 
noise,  or  other  time-varying  d3mamics  were  asstimed  to  exist.  The  optimal  con¬ 
trol  law  is  assumed  to  extremize  an  evaluation  of  the  state  trajectory  and  the  con¬ 
trol  sequence,  for  any  initial  condition. 

One  goal  of  this  thesis  was  to  present  an  investigation  of  several  approaches  for 
incrementally  synthesizing  (on-line)  an  optimal  control  law .  A  second  goal  was  to 
propose  a  direct  /  indirect  framework,  with  w^hich  to  distinguish  such  architec¬ 
tures.  This  thesis  unifies  a  variety  of  concepts  from  control  theory  and  behavioral 
science  (where  the  learning  process  has  been  considered  extensively)  by  present¬ 
ing  two  different  learning  algorithms  applied  to  the  same  control  problem:  the 
Associative  Control  Process  (ACP)  algorithm  [Klopf,  Morgan,  &  Weaver  (1992)], 
which  was  initially  developed  to  predict  animal  learning  behavior,  and  Q  learning 
[Watkins  (1989)],  which  derives  from  the  mathematical  theory  of  val  ue  iteration. 

Chapter  2:  The  Aeroelastic  Oscillator  describes  a  two-state  physical  system  that 
exhibits  interesting  nonlinear  dynamics,  and  was  used  throughout  the  thesis  to 
evaluate  different  control  algorithms  that  incorporate  learning.  The  algorithms 
that  are  explored  in  Chapters  3-5  do  not  explicitly  employ  dynamic  models  of  the 
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system  and,  therefore,  may  be  categorized  as  direct  methods  of  learning  an  opti¬ 
mal  control  law.  In  contrast,  Chapter  6  develops  an  indirect,  model-based,  ap¬ 
proach  to  learning  an  optimal  control  law. 

Chapter  3:  The  Associative  Control  Process  reviews  the  ACP  learning  paradigm 
(including  drive-reinforcement  learning)  and  discusses  an  application  of  an  ACP 
network  to  an  optimal  control  problem  involving  the  regulation  of  a  nonlinear 
aeroelastic  oscillator.  Simulation  results  are  presented.  This  chapter  also  intro¬ 
duces  the  concept  of  direct  learning  methods  in  conjunction  with  the  on-line  syn¬ 
thesis  of  an  optimal  control  law'. 

Chapter  4:  Policy  and  Value  Iteration  reviews  a  number  of  basic  concepts  includ¬ 
ing  those  of  policy  iteration,  value  iteration,  and  Q  learning.  In  addition,  simula¬ 
tion  results  of  the  application  of  Q  learning  to  the  aeroelastic  oscillator  problem 
are  presented. 

Chapter  5:  Temporal  Difference  Methods  reviews  a  general  theory  of  temporal  dif¬ 
ference  methods  as  developed  in  [Sutton  (1988)].  Following  this  review,  a  compar¬ 
ison  of  the  preceding  direct  methods  for  the  synthesis  of  optimal  control  laws  is 
presented. 

Chapter  6:  Indirect  Learning  Optimal  Control  introduces  the  notion  of  indirect 
methods  in  the  on-line  synthesis  of  optimal  control  laws  and  derives  several  that 
are  optimal  w.ith  respect  to  various  finite  horizon  cost  functionals.  The  structure 
of  the  control  laws  with  and  without  learning  augmentation  is  presented  for  sev¬ 
eral  cost  functionals,  to  illustrate  the  manner  in  which  learning  may  be  used. 

Chaper  7:  Summary  reviews  the  major  contributions  of  this  thesis;  in  addition, 
recommendations  for  future  research  are  presented. 

Appendix  A:  Differential  Dynamic  Programming  briefly  reviews  both  dynamic 
programming  (DP)  and  differential  dynamic  programming  (DDP),  which  are 
classical,  alternative  methods  for  synthesizing  optimal  controls.  DDP  is  not  re¬ 
stricted  to  operations  over  a  discrete  input  space  and  discrete  output  space.  The 
DP  and  DDP  algorithms  are  model-based  and,  therefore,  learning  may  be  intro¬ 
duced  by  explicitly  improving  the  a  priori  model,  resulting  in  an  indirect  learning 
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optimal  controller.  However,  neither  DP  nor  DDP  is  easily  implemented  on-line. 
Additionally,  DDP  does  not  address  the  problem  of  synthesizing  a  control  law  over 
the  full  state-space. 


1 .3  Program  Related  Technical  Publications 


The  work  underlying  the  following  list  of  technical  publica  tions  was  performed, 
in  whole  or  in  part,  under  thi.s  research  program.  The  publications  are  grouped 
according  to  t3T3e  (i.e.,  graduate  student  thesis,  conference  paper,  or  book  chapter) 
and  are  listed  in  chronological  order  within  each  group.  The  research  performed 
in  conjunction  with  the  three  graduate  student  theses  was  fully  funded  by  this 
program,  while  only  partial  support  was  provided  in  the  case  of  each  of  the 
remaining  publications.  Note  that  each  thesis  is  included  in  its  entirety  as  an 
attachment  to  this  document. 


Graduate  Student  Theses 

Baird,  L.  (1991).  Learning  and  Adaptive  Hybrid  Systema  for  Nonlinear  Control, 
CSDL  Report  T-1099,  M.S.  Thesis,  Department  of  Computer  Science, 
Northeastern  University.  [Attachment  1] 

Nistler,  N.  (1992).  A  Learning  Enhanced  Flight  Control  System  for  High 
Performance  Aircraft,  CSDL  Report  T-1127,  M.S.  Tliesis,  Department  of 
Aeronautics  and  Astronautics,  M.I.T.  [Attachment  2] 

Atkins,  S.  (1993).  Incremental  Synthesis  of  Optimal  Control  Laws  Using 
Learning  Algorithms,  CSDL  Report  T-llSl,  M.S.  Thesis,  Department  of 
Aeronautics  and  Astronautics,  M.I.T.  [Attachment  3] 


Conference  Papers 

Baird,  L.  &  Baker,  W.  (1990).  "A  Connectionist  Learning  System  for  Nonlinear 
Control,"  proceedings,  1990  AlAA  Conference  on  Guidance,  Navigation,  and 
Control. 

Raker,  W.  &  Farrell,  J.  (1990)  "Connectionist  Learning  Systems  for  Control," 
proceedings,  SPIE  OE / Boston  '90. 

Farrell,  J.  &  Baker,  W.  (1991),  "Learning  Augmented  Control  for  Advanced 
Autonomous  Underwater  Vehicles,"  proceedings,  18th  Annual  AUVS 
Technical  Symposium  and  E.xhibit. 

Alexander,  J.,  Baird,  L.,  Baker,  W.,  &  J.  Farrell  (.1991).  "A  Design  &  Simulation 
Tool  for  Connectionist  Learning  Control  Systems;  Application  to  Autonomous 
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Underwater  Vehicles,"  proceedings,  1991  SCS  Summer  Computer  Simulation 
Conference. 

Baker,  W.  &  Farrell,  J.  (1991).  "Learning  Augmented  flight  Control  for  Eigh 

Performance  Aircraft,"  proceedings,  1991  AlAA  Conference  on  Guidance,  ^ 

Navigation,  and  Control. 

Vos,  D.,  Baker,  W.,  &  Millington,  P.  (1991).  "Learning  Augmented  Gain 
Scheduling  Control,"  proceedings,  1991  AlAA  Conference  on  Guidance, 

Navigation,  and  Control.  *■ 

Baker,  W.  &  Millington,  P.  (1992).  "Adaptation  &  Learning  in  Control  Systems, 

Application  to  Flight  Control,"  proceedings,  1992  Government  Neural  Network 
Applications  Workshop. 

Millington,  P.,  Baker,  W.,  &  Koenig,  M.  (1993).  "Control  Augmentation  System 
(CAS)  Synthesis  via  Adaptation  &  Learning,"  proceedings,  1993  AlAA 
Conference  on  Guidance,  Navigation,  and  Control. 


BoolsLChapters 

Baker,  W.  &  Farrell,  J.  (1992),  "An  Introduction  to  Connectionist  Learning 
Control  Systems,"  in  White,  D.  &  Soige,  D.,  eds.,  Handbook  of  Intelligent 
Control:  Neural,  Fuzzy,  and  Adaptive  Approaches,  Van  Nostrand  Reinhold. 

Farrell,  J.  &  Baker,  W.  (1993).  "Learning  Control  Systems,"  in  Antsaklis,  P.  & 
Passino,  K.,  eds.,  Intelligent  and  Autonomous  Control  Systems,  Kluwer 
Academic. 

Farrell,  J,  &  Baker,  W.  "Learning  Control  Systems;  Motivation  and 

Implementation,"  to  appear  in  Intelligent  Control  Systems:  Theory  and 
Practice,  Gupta,  M.  &  Sinha,  N.,  eds.,  IEEE  Press. 
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2  DHve-Remforcemeiit  Learning 


This  chapter  addresses  those  aspects  of  the  program  that  were  primarily  con¬ 
cerned  with  an  investigation  of  reinforcement  learning  methods  and  their  appli¬ 
cation  to  probleij  s  in  automatic  control.  The  initial  focus  of  this  part  of  the  inves- 
tigation  was  on  the  drive-reinforcement  (D-R)  learning  paradigm  [Klopf  (1988)]; 
later  on,  the  scope  was  expanded  to  include  associative  control  process  (ACP)  net¬ 
works  [Klopf,  Morgan,  &  Weaver  (1992);  Baird  &  Klopf  (1992)].  Conventional  al¬ 
ternatives  to  learning  control  were  also  reviewed.  In  addition,  a  reinforcement 
learning  based  on  the  use  of  the  associative  search  element  /  adaptive  critic  ele¬ 
ment  [Barto,  Sutton,  &  Anderson  (1983)]  was  examined.  Experimental  results 
were  obtained  by  applying  some  of  these  approaches  to  a  cart-pole  stabilization 
and  tracking  problem.  The  use  of  reinforcement  learning  and  related  methods 
(e.g.,  ACP  networks  [Klopf,  Morgan,  &  Weaver  (1992)],  temporal  difference  meth¬ 
ods  [Sutton  (1988)],  and  Q  learning  [Watkins  (1989)])  in  the  context  of  optimal  con¬ 
trol  was  also  examined. 

2.1  Initial  Woi:^ 

The  first  part  of  our  investigation  of  the  drive-reinforcement  learning  paradigm 
amounted  to  a  review  of  the  relevant  technical  iiteratu’-e  on  the  subject,  prelimi¬ 
nary  theoretical  analysis  of  the  algorithm,  and  an  experimental  study  of  the  be¬ 
havior  of  the  algorithm  via  a  computer  simulation  we  developed.  As  a  check  of 
the  validity  of  this  software  simulation,  we  successfully  duplicated  every  experi¬ 
mental  result  described  in  [Klopf  (1988)].  The  motivation  and  development  of  the 
D-R  learning  paradigm  as  presented  in  [Klopf  (1988)]  is  quite  lucid  and  has  iio 
worthy  substitute — the  interested  reader  is  strongly  encouraged  to  examine  this 
reference,  as  well  as  [Klopf,  Morgan,  &  Weaver  (1992)],  before  considering  the  rest 
f  of  this  chapter.  Thus,  we  will  not  provide  a  summary  of  these  wortuS  per  se,  al¬ 

though  some  background  information  may  be  found  in  Attachment  3. 

2. 1.1  Network  Medei 

Early  on  in  our  investigation,  it  seemed  likely  that  a  netu'ork  of  drive- 
reinforcernent  learning  neurons  would  be  a  useful  and  perhaps  even  necessary 
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development  for  general  applications  of  this  paradigm  to  problems  in  automatic 
control.  To  this  end,  we  set  out  to  develop  a  suitably  general  dynamic  network 
model,  with  the  following  objectives  in  mind: 

•  develop  a  general  model  of  a  D-R  learning  network 

•  emphasize  adaptive/learning  control  application 

•  maintain  fidelity  to  Klopf  s  (single)  neuronal  model 

The  drive-reinforcement  learning  algorithm  defined  in  [Klopf  (1988)]  really  only 
pertains  to  the  behavior  of  a  single  neuron — ^it  is  not  immediately  clear  how  one 
should  generalize  such  a  model  to  describe  the  behavioi-  of  a  network  of  interact¬ 
ing  drive-reinforcement  neurons. 

A  number  of  interesting  design  issues  arise  when  one  contemplates  the  extension 
of  a  single  D-R  neuron  to  a  network  of  many  such  units.  Some  of  these  network 
design  issues  are  listed  below: 

•  number  of  neurons  (size  of  network) 

•  primary  drive  selection 

•  "potential"  (acquired)  drive  selection 

•  network  inputs 

-  binary 

-  real-valued 

•  parameters 

-  learning  rate  coefficients 

-  learning  interval 

The  proposed  network  model  outlined  below  appears  satisfactory  for  two  impor¬ 
tant  reasons;  (i)  in  the  special  case  where  the  network  is  comprised  of  exactly  one 
neuron,  the  network  model  corresponds  exactly  with  [Klopf  (1988)]  and  (ii)  the 
structure  of  the  network  model  strongly  resembles  the  structure  of  an  adaptive 
controller  implemented  as  a  system  of  nonlinear  difference  equations. 

Network  Drive  Equations 

The  network  drive  equations  provide  a  mathematical  description  of  the  input-out¬ 
put  behavior  of  a  drive-reinforcement  learning  network.  The.se  two  equations  are 
modeled  after  the  usual  state-space  representation  of  a  dynamic  system.  The  first 
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equation  introduces  the  concept  of  state  [the  vector  x(>fe)]  to  drive-reinforcement 
learning.  This  equation  describes  the  causal  relationship  between  the  current 
state  of  the  network  x(k)  and  its  previous  state  x(k  -  1)  and  current  input  stimuli 
UcsCA)  and  'Uus(^).  The  second  equation  simply  describes  the  (linear)  mapping  be¬ 
tween  the  current  state  of  the  network  x(k)  and  its  outputs  y(k). 

x(^)  =  +  A-(*)  +  A^}  x(k  -  1)  +  {B^(jfe)  +  B-()fc)}-Uc,(jt)  +  B“  u.,/A:)] 

y(^)  =  Cf-xik) 

Note  the  role  that  each  matrix  plays  in  determining  the  input-output  behavior  of 
the  drive-reinforcement  learning  network.  The  matrices  A*ik),  A'ik),  and  A° 
characterize  those  interactions  that  are  wholly  internal;  that  is,  those  interactions 
that  occur  among  the  neurons  of  the  network.  The  matrices  and  B° 

map  the  inputs  to  the  network  state,  while  the  matrix  C°  mens  the  network  state 
to  the  outputs. 

The  elements  of  the  matrices  A*(k),  A-ik),  B*{k),  and  B-(k}  represent  plastic 
sjmaptic  efficacies  that  are  constrained  to  be  strictly  and  exclusively  excitatory, 
inhibitory,  or  nonactive;  i.e., 

{a*j{k)  >  and  a~(k)  <  or  =  a-(k)  =  0}  for  all 

{h^jik)  >  HWn  and  b~(k)  <  or  [bijik)  -■=  b~(k)  =  0)  for  alU 

where  is  the  minimum  allowable  absolute  value  for  all  active  plastic  excita¬ 
tory  and  inhibitory  synaptic  efficacies.  The  matrices  A°,  B^  and  C°  are  all  con¬ 
stant;  the  elements  of  these  matrices  may  be  negative,  positive,  or  zero 

Tlie  vector-valued  function  /,v(x)  is  defined  below: 

f^(X)  =  i/A.(-il).  ....  fs-{X„)Y 

where  is  the  neuronal  output  function  shown  in  Fig.  2.1  and  the  symbol 
denotes  the  matrix  (or  vector)  transpose  operation. 

Network  Reinforcement  Equations 

The  network  reinforcemen.  t‘quations  de.scribe  the  adaptive  behavior  of  a  drive  re¬ 
inforcement  leai  ning  network,  fhe  fir.st  pair  of  equations  ai'count  for  adaptation 
due  to  external  reinforcement,  while  the  ,s.-cond  pair  account  for  internal  (inter- 
luuiron  or  intranetwork)  ri.'inforceinent. 
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I 

=  u- 


b*ik)  +  {x,<*)  -  Xiik  -  1 ) }  X  Oiibijik  -  -  0  -  ^cs.j  (&  -  /  -  1 )] 

i»\ 

bjfk)-  {xi(k)-Xi(k-  1))  atb~fk~t)S{ucs.j(k~l)-Uci,jik-l~  1)] 


/« I 


1)  =  /J 


atjik)  +  -  Xi(k  -  1)}  X  -  i)] 


i «  I 
T 


--  {Xt<k)  - x,ik  -  I )]'^  ai%{k ~  l)f{xj^k -[)- Xj{k~  I- 1)] 

/*  j 

The  functiona  fsi‘),  fw*(’),  and  )  are  shown  in  Figs.  2.2,  2.3,  and  2.4,  respec¬ 
tively.  Note  that  fw^x)  =  -f^(~<x). 

Single  D-R  Neuron 

In  the  special  case  where  the  network  consists  of  exactly  one  neuron,  the  network 
model  is  mathematically  equivalent  to  the  refined  drive-reinforcement  learning 
model  for  a  single  neuron,  as  described  in  IKlopf  (19£o;]: 

y{k)  ^  f^[h*ik)-yh-ik)f-»csik)  +  b^ujik)] 


1)  =  fw^\ 


bjik^l)  =  f^\ 


bt(k)-^  [y(t)-yik-  i,)}  ^  aib*ik  -  r)flucs,iik- [)  ~  Ucs.iik- 1)] 


4  =  1 

r 


bjik)  -  {>>(*)  -  yik  -  1)}  aibjik  -  l)f{ucy ,  (*  -  0  -  i{k-l  -1 )] 

^  /  3C  1  J 

where  it  is  assumed  that  the  output  of  the  neuron  is  simply  its  state  [c®  =  1],  that 
there  are  no  self-loops  {a*(k)  =  a'(A)  =  0],  and  that  the  neuron  has  only  one  uncon- 
ditionable  stimulus  (r^,  =  1].  With  the  following  change  of  variables,  these  equa¬ 
tions  may  be  transformed  into  the  equations  governing  the  behavior  of  a  single 
drive-reinforcement  neuron  (see  [Klopf  (1988)]): 

bt'^)  W2,  _  ,(A)| 

b]{k)  =>  >V2,(^)  )  for  1  <  I  ^  r„ 

«o.<(*)  ->-i(<^)  I 


h”  =>  >^2,  i{k)  \ 
^cs.i(k)  =>  ^i(k)l 


for  i  =  rr,-t-  1 


a/  Cl 


for  1  <  /  <  T 
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Fifj.  2.1.  Neuronal  Output  Function.  Fig.  2.2.  Reinforcement  Stimuli  Function. 


Fig.  2.3.  Excitatory  Adaptation  Function.  Fig.  2.4.  Inhibitory  Adaptation  Function. 


Tlte  network  model  of  drive-reinforcement  learning  outlined  above  is  quite  gen¬ 
eral.  Few  restrictions  have  been  made  regarding  the  topology  of  the  network;  for 
example,  self-loops  have  not  been  disallowed,  nor  has  the  influence  of  multiple 
unconditionable  stimuli  on  the  same  neuron  been  ruled  out.  Such  constraints 
might  prove  helpful  in  the  refinement  of  the  network  model.  If,  for  instance, 
drive-reinforcement  neurons  are  not  allowed  to  have  self-loops,  then  each  element 
on  the  main  diagonal  of  the  three  matrices  A*{k),  A~{k),  and  A’,  would  necessar¬ 
ily  have  to  be  7;ero.  Similarly,  if  a  neuron  is  allowed  to  have  at  most  a  single  un¬ 
conditionable  stimuli,  then  each  row  of  the  matrix  B°,  would  have  at  most  a  single 
nonr.ero  element. 
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2JL2 _ Classical  Conditioning  Ejtperiments 

As  mentioned  above,  a  software  simulation  of  these  network  equations  was  devel¬ 
oped  and  implemented  on  a  personal  computer.  The  goal  was  to  develop  a  tool 
with  which  we  could  easily  explore  the  nuances  of  the  algorithm.  The  special 
case  of  a  network  with  only  a  single  neuron  was  used  to  duplicate  all  of  the  classi¬ 
cal  conditioning  experiments  described  in  [Klopf  (1988)1.  The  results  we  obtained 
were  identical  to  those  reported  in  this  reference  and,  hence,  will  not  be  repeated 
below. 


2.1.3  Analysis 


In  an  attempt  to  better  understand  the  D-R  learning  algorithm,  we  examined  a 
number  of  its  attributes,  particularly  in  the  special  case  of  a  single  neuron.  As  a 
reminder,  we  provide  the  weight  update  equations  for  a  single  neuron  below: 


1)  =  fw 


w~(k)  +  Ay(k)  ^  aj\w'(k  -  ^^flAuiik  -;)] 


The  aquilibrium  conditions  of  the  input-output  behavior  of  a  single  D-R  neuron,  as 
well  as  the  equilibrium  conditions  associated  with  its  adjustable  weights,  were 
examined  under  classical  conditioning  (open-loop)  experiments.  The  conditions 
for  zero  weight  change  are  shown  below: 

1.  w*{k)  =  aad  ^yi.k)  <  o 

w~(Jt)  =  -w^in  ansi  M*)  >  h 

2.  Ay{k)  =  0 

3.  Au,(fc)  <  0  for  1<;<T 

2.2  Further  Woric 

After  having  gained  a  basic  understanding  of  the  drive-reinforcement  learning 
paradigm  under  mostly  open-loop  conditions,  .ve  began  to  investigate  its  behavicr 
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under  closed-loop  conditions.  We  were  particularly  interested  in  its  convergence 
and  stability  properties  under  feedback. 

Simple  System  Dynamics 

Prior  to  the  application  of  the  D-R  learning  algorithm  to  the  problem  of  controlling 
the  cart-pole  system,  we  elected  to  examine  its  performance  relative  to  a  simpler 
(two-dimensional)  control  problem  with  related  dynamics.  In  particular,  the 
"simple  system"  dynamics  were: 

•  linear 

•  open-loop  unstable 

•  nonminimum  phase 

•  2  state  variables 

•  open-loop  transfer  function: 

Y(s)  ^  (j-3.8) 

U{s)  5(5 -4.0) 

A  variety  of  different  controller  confijyurations  were  explored,  as  were  mcdifica- 
tions  to  the  basic  D-R  neuronal  model  to  accommodate  bipolar  outputs.  Irxitial 
work  with  this  system  showed  promise,  so  we  moved  on  to  the  full  cart-pole  con¬ 
trol  problem. 

Cart-Pole  System  Dynamics 

Nominal  cart-pole  system  dynamics: 

•  nonlinear 

•  open-loop  unstable 

•  nonminimum  phase 

•  4  state  variables:  {x,  6,  v.,  o)} 

•  open-loop  transfer  function  for  linearized  model: 

TO  ^  (5-3■8360)(5-^3■S360) 

F(s)  s^(s  -  3.9739)(y  +  3.9739) 

•  open-loop  poles  and  zeros  in  complex  pleine: 


I  r 

■4.0  -3.0  -2.0 


1  -p 

1.0  2.0 


4.0 


\ 

3.0 
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Under  certain  ideal  conditions,  a  steady  motion  of  the  cart-pole  system  can  be 
achieved: 

»  no  noise  or  disturbances 

•  pole  angle  is  constant  and  nonzero 

/ 

•  applied  force  is  constant  and  nonzero 

•  pole  angle,  applied  force,  cart  velocity,  and  cart  acceleration  all  have  same 
sign 


Steady  motion  ensues  if 


I  ian-  ^ 

f~Pc 

g(mc  +  ntp) 

/+M:  ■ 

tan"’ 

g(mc  +  nip) 

for 

for 


0,  i  >  0  and  /> 

6,  x<0  and  /< ~;jc 


The  steady  motion  properties  are  important  because  they  relate  to  pole-balancing 
experiments  performed  at  Wright  Laboratory. 

WL  Pole-Balancing  Experiments 

We  weie  also  fortunate  enough  to  have  access  to  a  set  of  pols-balaiicing  experi¬ 
ments  (i.e.,  without  restrictions  on  cart  position)  that  were  generated  by  re¬ 
searchers  in  the  Avionics  Directorate  at  the  USA.F  Wright  Laboratory.  The  basic 
scenario  for  this  problem  is  outlined  below: 

•  pole-balancing  only 

•  infinite  track 

failure  if  absolute  value  of  pole  angle  exceeds  12  deg 

•  pole  angle  and  angular  rate  information  only 

•  sampling  rate  of  50  Hz 

•  response  to  initial  10  N  disturbance  from  state-space  origin 
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The  WL  pole-balancing  results  were  successfully  duplicated  by  our  software  sim¬ 
ulation.  However,  in  the  course  of  this  work,  we  obtained  a  auraber  of  interesting 
results  that  were  not  expected.  These  results  were  found,  in  part,  as  we  sought  to 
answer  the  follovdng  questions: 

•  what  has  the  controller  actually  "learned"? 

•  is  the  acquired  steady  condition  stable? 

•  what  if  the  state-space  quantization  is  refined? 

Some  of  these  issues  are  discussed  in  Section  2.3. 

Chaotic  Behavior 

In  the  course  of  evaluating  the  pole-balancing  results,  we  detected  a  discrep¬ 
ancy  between  software  simulations  run  on  different  computers.  One  machine 
produced  a  very  small  round-off  error  (on  the  order  of  10“!®)  relative  to  the  same 
(single)  calculation  performed  on  a  second  machine.  Ordinarily,  such  a  small 
numerical  error  is  of  no  con  sequence;  however,  if  the  simulation  is  numerically 
unstable  (as  is  the  WL  pole-balancing  experiment,  since  cart  position  and  velocity 
go  to  infinity),  then  even  very  small  errors  tend  to  grow  to  significant  levels.  This 
numerical  instability  combined  with  possible  bifurcations  due  to  different  trajec¬ 
tories  through  the  quantized  state-space,  results  in  extreme  sensitivity  to  initial 
conditions  and/or  round-off  errors  (i.e.,  chaotic  behavior).  The  upshot  of  this  is 
that  simulations  of  the  WL  pole  -balancing  experiment  run  on  different  machines 
can  produce  very  different  results. 

Pole- Balancing  Experiments  Summary 

Experiments  performed; 

•  nominal 

•  perturbed 

•  perturbed- 

•  rand'  n  initial  disturbances 

•  refined  state-space  quantization 

•  perturbed  after  steady  motion  achieved 

Results: 
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•  steady  motion  possible 

•  unstable  dynamics 

•  small  errors  grow  to  significant  levels 

•  trajectory  bifurcations  due  to  quantized  state-space 

•  simulation  is  extremely  sensitive  to  initial  conditions  and/or  round-off  errors 
Elements  of  the  solution: 

•  stable  limit  cycle  about  vertical  must  be  attained 

•  control  level  in  outer  bins  must  provide  sufficient  restoring  force 

•  cart  position  control  may  be  subsequently  achieved  by  biasing  sensed  pole 
angle 

23  Refinement  of  ]>*R  Network  Equations 

One  way  for  a  learning  system  to  solve  a  control  problem  is  by  learning  the  appro¬ 
priate  feedback  gains  for  the  state  variable  and  command  inputs.  Taken  one  step 
further,  the  network  could  also  multiply  these  inputs  by  the  gains  it  has  learned, 
and  then  output  the  answer  as  the  appropriate  control  action.  Under  these  condi¬ 
tions,  the  input-output  behavior  of  the  network  is  essentially  equivalent  to  that  of  a 
gain  vector.  The  main  difficulty  encountered  in  applying  a  D-R  learning  network 
to  this  problem  rests  firmly  with  the  fact  that  such  a  network  is  incapable  of 
maintaining  the  feedback  gains  it  has  learned  as  the  control  system  is  exercised.^ 
Alternatively,  if  the  network  attempts  to  learn  the  appropriate  control  actions  di¬ 
rectly  (as  a  function  of  the  state  and  input  commands),  the  same  difficulty  re¬ 
mains. 

Another  way  to  state  the  basic  underlying  problem  is  as  follows:  in  general,  a  D-R 
network  that  is  used  for  feedback  control  will  always  be  exposed  to  time-varying 
inputs  (e.g.,  state  variables,  input  commands,  and  performance  measurements), 
and  will  always  be  expected  to  provide  time-varying  outputs  (e.g.,  actuator  com¬ 
mands),  in  accordance  with  the  normal  operation  of  the  control  system.  In  turn, 


^  The  word  "exercised"  is  used  to  describe  the  noimal  active  use  of  a  control  system  in  which 
time-varying  input  commands  are  specified.  For  example,  a  control  system  might  be  used  to 
position  a  cart-pole  object  on  a  flat  horizontal  trac.k;  if  a  time-varying  command  signal  is  used 
(e  g.,  tiie  controller  is  told  to  move  the  cart-pole  object  back  and  forth  between  two  fixed  points 
on  the  track),  then  the  control  system  is  being  exercised 
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this  implies  that  the  neurons  within  such  a  network  will  also  receive  time-varj'- 
ing  inputs  and  outputs.  Under  such  conditions,  the  s3nQaptic  weights  associated 
with  these  neurons  will  also  vary  as  a  direct  consequence  of  the  D-R  learning  al¬ 
gorithm.  Thus,  if  one  assumes  that  some  D-R  network  has  "learned”  to  behave  as 
the  desired  controller  at  time  t  (which  implies  that  it  has  acquired  a  suitable,  but 
perhaps  not  unique,  set  of  weights  =  {wi, ...,  mn)),  then  simply  because  any 
of  the  network  inputs  and  outputs  varies  in  the  course  of  normal  operation  (i.e., 
under  conditions  in  which  no  learning  is  required),  some  network  weights  will 
change  as  well.  As  a  result,  the  new  network  weights  will  no  longer  correspond 
to  the  set  of  weights  associated  with  the  controller  that  previously  had  "learned" 
the  desired  control  law  (i.e.,  loit)  ^  u;(i+l),  in  general).  Our  conclusion  is  that  a 
network  of  D-R  learning  neurons  (as  narrowly  defined  in  [Klopf  (1988)])  suffers 
from  £m  instability  problem  when  used  under  closed-loop  conditions,  in  the  sense 
that  the  weights  do  not  appear  to  have  any  stable  equilibrium  configurations  when 
the  closed-loop  system  is  exercised.  No  "reasonable"  network  structures  have 
been  identified  which  overcome  this  problem. 

2.3.1  Multiplication  Learning  Problem 

A  very  simple  closed-loop  control  problem  that  can  be  used  to  illustrate  some  of  the 
foregoing  concepts  is  based  on  a  network  that  takes  one  input  and  learns  to 
provide  a  single  output  that  is  the  input  times  a  constant  gain  k.  This  highly 
simplified  control  problem  will  be  called  the  multiplication  learning  problem. 

To  help  the  network  learn  the  desired  gain  k,  it  ^vill  have  access  to  feedback  sig¬ 
nals  that  give  it  clues  concerning  its  current  behavior.  In  this  problem,  we  as¬ 
sume  that  the  network  has  access  to  a  single  feedback  signal  that  tells  it  whether 
its  output  is  too  high  or  too  low,  and  by  how  much.  This  signal  will  be  called  the 
error  signal. 

Clearly,  for  some  learning  algorithms  this  problem  can  be  solved  by  a  network 
comprised  of  a  single  plastic  weight:  the  input  is  multiplied  by  the  weight  to  give 
the  output,  and  the  weight  is  adjusted  proportional  to  the  error  signal.  In  fact,  a 
weight  update  algorithm  can  be  easily  derived  that  allows  "deadbeat"  control 
(learning  occurs  in  a  single  time  step).  Performance  at  this  level  will  not  be  re¬ 
quired,  however;  it  will  be  acceptable  if  the  plastic  weight  approaches  k  asymptot- 
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ically,  as  it  would  if  a  gradient  learning  algorithm  were  used.  It  would  even  be 
acceptable  if  the  weight  fluctuated  around  the  correct  value,  as  long  as  it  stayed 
"in  its  neighborhood."  Thus,  the  problem  constraints  have  been  relaxed  some¬ 
what  by  reqmring  that  the  network  eventually  learn  to  maintain  its  output  within 
some  prescribed  range  of  the  desired  output. 

Some  potential  approaches  for  solving  the  multiplication  learning  problem  involve 
feedback  loops  within  the  network  In  such  cases,  it  may  take  a  significant 
amount  of  time  for  an  output  to  be  generated  by  a  given  input.  To  make  the  prob¬ 
lem  easier  for  the  learning  network  ,  it  will  also  be  assumed  that  the  input  only 
changes  at  periodic  intervals,  and  that  the  output  associated  with  each  input 
must  be  calculated  before  the  end  of  an  input  interval.  The  network  designer  in 
this  problem  is  firee  to  create  a  network  of  any  finite  size,  with  any  finite  number  of 
plastic  weights,  and  may  also  assume  that  the  input  changes  at  any  deaired  finite 
periodic  rate.  These  rules  give  the  designer  a  great  deal  of  freedom,  tind  should 
make  the  problem  as  easy  as  is  possible  without  radically  changing  its  nature. 

This  problem  is  a  simplified  version  of  what  a  realistic  control  system  should  be 
able  to  accommodate  and  is,  therefore,  probably  something  that  any  learning  con¬ 
trol  system  should  be  able  to  accommodate.  D-R  networks  appear  to  be  unable  to 
solve  this  problem,  even  if  large  hierarchical  xietworks  are  used.  In  general,  this 
is  because  the  solution  k  must  be  stored  in  a  plastic  weight  somewhere  in  the 
network,  and  must  periodically  have  the  input  signal  applied  to  it.  Since  the  input 
signal  may  change  several  times  within  one  t  period,  input-output  correlations 
will  occur  that  cause  the  weight  to  change,  even  when  it  is  already  at  the  correct 
value  of  k.  Stable  equilibrium  weight  configurations  are  essentially  unobtainable 
in  a  D-R  learning  network  subject  to  closed-loop  conditions. 

2.3.2  Facilitatorv  Learning 

Faciiitatory  learning  (and/or  other  ancillary  learning  mechanisms)  may  com¬ 
plement  the  basic  D-R  learning  algorithm  in  such  a  way  as  to  allow  D-R  learning 
to  be  successfiilly  applied  to  feedback  control  problems.  We  have  some  (untested) 
ideas  on  this  subject. 

Weight  update  equations  for  a  single  modified  neuron: 
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wrik+ 1)  =  fw^ 

w~(k  +1)  =  fw- 


w^(k)  +  Aufik)  aj\wf(k  -  t^flAUtik  -;)] 


j  =  i 
r 


wjm  +  AuTik)  Y  a/kr(^'  -  -;)] 

y  =  1  J 

•  weight  change  mediated  by  an  otherwise  inert  stimulus 

•  self-loop  (without  delay  ...)  corresponds  to  unmodified  D-R  neuron 


These  proposed  modifications  are  similar  in  effect  to  subsequent  refinements  of 
the  drive-reinforcement  paradigm  made  by  its  developers,  particularly  in  the 
special  case  where  it  is  embedded  in  an  associative  control  process  network  (e  g., 
see  [Klopf,  JSorgan,  Weaver  (1992)]).  As  we  later  discovered,  this  refinement  does 
allow  the  combined  D-R/ACP  approach  to  be  successfully  applied  to  optimal 
control  problems — this  work  is  fully  described  in  Attachment  3. 


2.4  Alternatives  Strat^es  for  Learning  Ciontrol 


In  preparation  for  the  comparative  evaluation  phase  of  the  program,  we  consid¬ 
ered  alternative  approaches,  including  conventional  adaptive  control  and  Barto- 
Sutton  learning  control.  In  addition,  we  explored  key  issues  related  to  control  sys¬ 
tem  configuration  and  control  system  performance.  These  issues  have  not  been 
fully  resolved. 


2lA1 _ Performaflce  M£asmss-&  PesigB-lsaafig 

The  key  performance  measures  and  design  issues  that  we  considered  in  our 
evaluation  are  summarized  below. 


Performance  measures: 

•  stability 

•  transient  response 

-  sfittling  time 

-  overshoot 

•  steady-state  behavior 

tracking  eiror 

-  limit  cycles 
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•  robustness  to  modeling  uncertainty 

•  sensitivity  to  noise  and  disturbances 

•  control  action  (energy,  power) 

•  adaptation  time 

•  optimality  (local  vs.  global) 

Other  issues: 

•  controller  design  difficulty 

-  structuie 

-  parameters 

-  training  process 

•  information  requirements 

-  a  priori  (design  and  training) 

-  real-time  (measurements) 

•  implementation  ' 

-  processing  requirements 

-  storage  requirements 

-  sequential  vs.  parallel 

-  cycle  time 

•  scale-up  to  more  difficult  problems 
2.4.2  Alternative  Approaches 

A  number  of  alternative  approaches  to  drive-reinforcement  learning  were  consid  ¬ 
ered  as  candidates  for  subsequent  application  to  the  flight  control  problem.  Sev¬ 
eral  conventional  approaches  were  included  in  this  comparison  to  provide  a 
stronger  basis  for  comparison.  The  results  of  this  comparison  are  summarized 
belov". 

Linear  State  Feedback  Control 
Basis: 

•  linear  combination  of  the  state  variables 

•  [Kailath  (1980);  Maybeck  (1979)1 
Advantages: 

•  well-developed  theory 
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•  "turnkey"  design  process 

•  wide-ranging  applicability 
Disadvantages: 

•  linear  and  quasi-linear  systems  only 
®  sensitive  to  modeling  uncertainty 

•  not  adaptive 

Gain  Scheduling 
Basis: 

•  multiple  linear  controllers 

•  switching  or  interpolation 

•  [Astrom  &  Wittenmark  (1989)] 

Advantages: 

•  nonlinear  control  law 
Disadvantages: 

•  extensive  manual  tuning  aired 

•  ad  hoc  design  =>  "black  art" 

•  not  adaptive 

Indirect  Adaptive  Control 

Basis:  . 

•  parameter  estimation  (e.g.,  recursive  least  squares) 

•  linear  design  (e.g.,  pole-placement,  LQR) 

•  [Gupta  (1986);  Narendra,  Ortega,  &  Dorato  (1991)) 
Advantages: 

•  adaptive 

•  flexible  design 
Disadvantages; 

•  age-weighting  or  resetting  required 

•  persistent  excitation  required 

•  identification  problems 
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Model  Reference  (Direct)  Adaptive  Control 
Basis: 

•  explicit  description  of  desired  system  behavior  (reference) 

•  gradient  algorithm 

•  [Astrom  &  Wittenmark  (1989);  Narendra  &  Annaswamy  (1989)] 
Advantages: 

•  identification  not  required 
Disadvantages: 

•  stability  problems  for  nonminimum  phase  systems 

•  local  optimization  only 

Direct  Adaptive  Control:  TDC 

• 

Basis: 

•  "time-delay"  control:  adaptive  nonlinear  transformation 

•  [Youcef-Toumi  &  Ito  (1990)] 

Advantages; 

•  simple  algorithm 

•  exceptional  flexibility 
Disadvantages: 

•  sensitive  to  "B"  matrix  uncertainty  (e.g.,  unknown  actuator  dynamics) 

•  high  sampling  rate  required 

•  state  derivative  required 

ASE/ACE  Learning  Control 
Basis: 

•  associative  reinforcement  learning 

•  [Barto,  Sutton,  &  Anderson  (1983)1 
Aa  vantages: 

•  simple  learning  algorithm 

•  optimii;ing 
Disadvantages: 

•  bang-bang  control  laws  only 

•  state-space  quantization  required 


•  high  storage  requirements 

•  performance  does  not  improve  with  refinement  of  quantization  scheme 

•  difficult  to  specify  control  objective 

None  of  the  various  approaches  considered  above  was  completely  satisfactory  for 
the  objectives  we  had  in  mind.  In  addition  to  carrying  out  software  simulations  of 
the  basic  algorithms,  we  also  conceived  and  implemented  a  number  of  modifica¬ 
tions  to  these  algorithms  (e.g.,  a  type  of  "annealing"  algorithm  for  the  ASE/ACE 
paradigm).  Even  so,  none  appeared  to  be  suitable  for  our  application  to  flight  con¬ 
trol.  However,  as  a  result  of  the  experience  end  insight  gained  during  our  exam¬ 
ination  of  these  various  approaches  as  well  as  the  D-R  learning  paradigm,  we 
were  able  to  first  conceive  and  then  develop  a  novel  hybrid  adaptive/learning  con¬ 
trol  approach  that  was  ultimately  successful.  This  approach  is  the  subject  c  f  most 
of  the  remaining  chapters. 

2S  I^R  Revisited:  ACP  NetwoiHks  &  Learning  for  Optimal  Control 

As  mentioned  at  the  outset,  a  decision  was  made  to  revisit  the  drive-reinforcement 
learning  paradigm  in  the  special  case  where  it  is  embedded  in  an  associative 
control  process  network.  The  results  of  this  phase  of  the  program  are  fully 
described  in  Attachment  3. 
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3  1^amiiiigforFli^tCk>iitxt>l 


This  chapter  provides  a  high-level  discussion  of  the  motivation  for,  as  well  as  the 
issues  underlying,  the  use  of  learning  in  flight  control  applications. 

After  a  brief  introduction  in  Section  3.1,  some  background  material  on  the  topic  of 
flight  control  system  design  is  presented  in  Section  3.2.  Key  differences  between 
adaptation  and  learning  (in  this  context)  are  outlined  in  Section  3.3.  Section  3.4 
motivates  and  presents  a  liigh-level  description  of  the  hybrid  adaptive/learning 
control  methodology,  which  will  be  developed  in  more  detail  in  the  next  chapter. 
Section  3.5  elaborates  on  the  idea  that  learning  (in  the  context  of  control)  can  be 
considered  as  automatic  (c^^line)  function  s3mthesis.  Section  3.6  discusses  a 
number  of  important  application  issues  associated  with  the  use  of  adaptation  and 
lejiming  in  flight  control.  The  potential  benefits  of  learning  augmentation  are 
then  summarized  in  Section  3.7. 

3.1  Introduction 

The  design  of  automatic  control  systems  for  high  performance  furcraft  represents 
a  difficult  and  challenging  problem  because  of  the  coupled,  multivariabl..,  nonlin¬ 
ear,  and  time"Vai7ing  nature  of  flight  dynamics,  in  conjunction  with  the  uncer¬ 
tainties  associated  with  existir.g  aerodynamic  vehicle  models.  The  added  specifi¬ 
cation  of  "high  perfonnance"  generally  implies  an  expanded  flight  envelope  and 
faster  dynamics — attributes  that  only  exacerbate  the  problem,  t^^'nventional  con¬ 
trol  system  design  methods  for  such  systems  have  a  number  of  important  limita¬ 
tions.  Fixed  parameter,  off-line  control  system  design  approaches  (e.g.,  based  on 
gain  scheduling,  dynamic  inversion,  or  extended  linearization)  t.jie  suitable  for 
nonlinear  problems  where  there  is  little  or  no  model  uncertainty;  in  practice,  they 
often  require  extensive  manual  tuning  (which  can  lead  to  excessive  development 
costs)  because  the  physical  models  used  during  the  design  process  do  not  always 
accurately  reflect  the  actual  system  dynamics.  Robust  design  methods  deal  with 
the  problem  of  model  uncertainty,  but  may  sacrifice  closed-loop  system  perfor¬ 
mance  as  a  result,  and  may  be  impractical  for  problems  involving  significant 
nonlinearity  or  time-varying  dynamics.  Adaptive  control  approaches  can  ac¬ 
commodate  some  parametric  model  uncertainty  and  slowly  time-varying  dynam- 
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ics,  but  may  be  unsuitable  or  inefficient  for  pr'  ’  xems  involving  significant  struc¬ 
tural  mod  uncertainty  (e.g.,  signiiirant  nonlinearity). 

An  alternative  approach  relies  on  the  use  of  learning  to  sug-nent  the  flight  control 
system.  This  approach  can  directly  accojornodr^e  paiametric  uncertainty  and 
soice  structural  uncertainty  (including  meraoiyless  nonlinearities).  Learning 
systems  offer  unique  capabilities  that  may  be  exploited  to  provide  superior  flight 
control  systems.  The  approach  we  have  pursued  relies  on  special-purpose 
connectionist  systems  thnt  can  be  used  for  on-line  learning.  Thf  key  concept  un¬ 
derlying  our  methodology  is  the  v'  ;w  that. 

/ 

/ 

learning  may  be  interpreted  as  th  3  automat  ic  synthesis  of  multivari¬ 
able  functional  mappings,  based  u  /  experi<^ntial  infomation  that  is 
gained  incrementally  over  time,  and  a  crite>  ion  for  optimality. 

When  combined  with  adaptation,  the  resulting  hybrid  control  strategy  provides  a 
powerful  control  system  design  and  implementation  tecluiique.  For  flight  control 
applications,  we  argue  that  advanced  control  systems  incorporating  learning  may 
be  used  advantageously  to: 

•  facilitate  the  control  system  design  and  tuning  process 

•  accommodate  modeling  error  through  on-line  interaction  with  the  actual 
vehicle 

•  improve  performance  through  on-line  self-optimization 

•  improve  efficiency  by  reducing  undesirable  transient  effects  that  would 
ordinarily  be  induct  d  by  parameter  adjustment  in  a  purely  adaptive 
controller 

In  a  fundamental  sense,  the  flight  control  system  design  problem  is  to  find  an  ap¬ 
propriate  functional  mapping,  from  measured  and  desired  vehicle  outputs,  to  a 
set  of  control  actions  that  will  produce  satisfactory  behavior  in  the  closed-loop  sys¬ 
tem.  In  other  words,  the  problem  is  to  choose  a  function  (a  control  law)  that 
achieves  certadn  closed-loop  performance  objectives  when  applied  to  the  open-loop 
system.  In  turn,  the  solution  to  this  problem  may  naturally  involve  other  map¬ 
pings  (recall  Section  l.i).  In  general,  these  mappings  may  represent  static  or 
dynamic  functions. 
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If  there  is  adequate  design  information  (i.e.,  if  all  pertinent  vehicle  dynamic  and 
aerodynamic  models  are  available)  and  if  there  is  little  or  no  significar'  uncer¬ 
tainty  in  this  data,  then  (in  principal)  the  mappings  required  to  produce  a  satis¬ 
factory  flight  control  system  can  be  designed  and  developed  through  a  completely 
off-line  process,  v^liich  results  in  an  a  pr  iori  controller.  Unfortunately,  this  situa¬ 
tion  rarely  exists  in  practice,  particularly  when  the  design  is  for  a  system 
as  complex  as  a  modem  high  performance  aircraft.  At  the  very  least,  manual 
tuning  of  the  nominal  control  law  will  be  required,  following  initial  flight  testing. 

The  need  for  learning,  in  the  context  of  flight  control,  arises  in  situations  where  a 
system  must  operate  in  conditions  of  uncertainty,  and  where  the  available  a  priori 
information  is  so  limited  that  it  is  impossible  or  impractical  to  design  in  advance 
a  system  that  has  fixed  properties  and  also  performs  sufficiently  well  [Tsypkin 
(1973)].  Current  design  trend.s  for  high  performance  aircraft  (e.g.,  flight  envelope 
expansion  into  more  complex  and  less  understood  flight  regimes)  suggest  that  the 
traditional  off-line  design  approach,  followed  by  flight  test  and  manual  tuning, 
will  become  increasingly  difficult,  perhaps  even  to  the  point  where  this  approach 
is  no  longer  adequate,  nor  cost-effective  [Baker  &  Farrell  (1991);  Steinberg  (1992)]. 
Accordingly,  a  central  role  of  learning  in  flight  control  is  to  enable  a  wider  class  of 
problems  to  be  solved,  by  reducing  the  prior  uncertainty  to  the  point  where  sat¬ 
isfactory  solutions  can  be  obtained,  in  part,  on-line. 

3^  Background:  Fli^t  Control  System  Design 

Subsection  3.2.1  briefly  reviev/s  some  of  the  difficulties  that  may  be  encountered  in 
the  design  of  flight  control  systems,  while  Subsection  3.2.2  describes  limitations  of 
the  traditional  approaches  that  are  used  to  address  these  problems.  Issues  that 
are  not  related  to  learning  augmentation  per  se  have  been  purposely  excluded 
from  the  discussion. 

3.2.1  Design  Difficulties 

An  effective  flight  control  system  design  must  address  several  difficulties  related 
to  the  complex  dynamical  behavior  of  aircraft,  as  well  as  a  further  difficulty 
arising  from  modeling  errors. 
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Multivariable  Control  &  Dynamic  Coupling 


Because  aircraft  rely  on  multiple  effectors  (e.g.,  ailerons,  canards,  elevators, 
rudders,  and  thrust-vectoring)  to  simultaneously  control  a  number  of  outputs 
(e.g.,  attitude  and  attitude  rates),  the  control  system  design  problem  is  formally  a 
multivariable  one.  Due  to  a  combination  of  rigid-body  and  aerodynamic  effects, 
the  principal  state  variables  associated  with  aircraft  flight  dynamics  (altitude,  ve¬ 
locity  vector,  orientation  angles,  and  angular  velocity  vector)  are  coupled  via  the 
equations-of-motion.  Thus,  for  example,  nonzero  roll  rates  can  cause  yaw  rate 
changes  (i.e.,  roll-yaw  coupling).  In  some  cases,  it  may  be  possible  to  decouple 
such  input/output  modes  and  design  multiple  independent  single-input/single- 
output  (SISO)  controllers.  However,  this  approach  has  the  disadvantage  that  sys¬ 
tem  performance  will  often  be  sacrificed,  since,  in  general,  a  multivariable  con¬ 
trol  system  may  be  necessary  to  fully  exploit  the  dynamical  potential  of  the  vehicle 
and  obtain  maximal  system  performance  (e.g.,  maximal  maneuvering  capabil¬ 
ity).  The  design  of  a  full  multivariable  control  system  is  often  considerably  more 
difficult  than  the  design  of  multiple  independent  SISO  controllers,  due  to  the 
higher  dimensionality  and  dynamic  coupling  associated  with  the  system. 

Nonlinearity 

The  dynamical  behavior  of  all  aircraft  is  inherently  nonlinear.  This  is  due  to;  (i) 
aerodynamic  forces  and  moments  that  are  complex  nonlinear  functions  of  air¬ 
craft  state,  (ii)  dynamic  coupling  terms  that  have  a  nonlinear  form,  (iii)  actuators 
and  effectors  that  have  physical  limitations  (e.g.,  saturation  and  rate  limits),  and 
(iv)  nonlinear  engine  dynamics.  For  example,  control  surface  effectiveness  de¬ 
pends  on  the  speed,  altitude,  and  attitude  of  the  aircraft — under  some  conditions  a 
control  surface  may  become  ineffective  (e.g.,  in  a  stall)  or  even  reverse  its  effec¬ 
tiveness  (e  g.,  as  in  aileron  reversal).  Standard  aircraft  equations-of-motion  are 
structurally  nonlinear,  moreover,  the  "parameters"  used  are  often  themselves 
nonlinear  functions  of  the  aircraft  state.  Under  such  conditions,  a  single  fixed 
gain  linear  control  design  will  generally  be  inadequate. 
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Time  Varying  Dynamics 


Additional  control  system  design  difficulties  arise  because  the  dynamical  behav¬ 
ior  of  an  airci'aft  can  change  over  time.  The  effect  of  these  variations  may  be  ei¬ 
ther  predictable  or  unpredictable.  Sources  of  time-varying  dynamics  whose  effect 
cannot  be  predicted  include  disturbances  (e.g.,  wind  gusts),  component  degrada¬ 
tion,  and  component  failures.  In  contrast  to  these  sources  of  time-varjKng  dy¬ 
namics,  there  are  others  whose  effect  can  be  predicted.  For  instance,  aircraft  dy¬ 
namical  behavior  varies  (in  a  predictable  manner)  as  a  function  of  configuration 
changes  (e.g.,  wing  sweep),  fuel  use,  or  payload  deplo3rment.  These  particular 
time-var3dng  behaviors  can  actually  be  construed  as  spatial  dependencies,  where 
the  state  of  the  vehicle  is  augmented  to  include  variables  (which  may  be  analog  or 
discrete-valued)  such  as  WING_SWEEP_ANGLE,  REMAINING_FUEL,  or  PAYLOAD. 
DEPLOYED.  To  accommodate  such  predictable  temporal  variations,  a  control  sys¬ 
tem  must  be  explicitly  designed  to  do  so  (e.g.,  via  gain  scheduling),  be  adaptive,  or 
be  robust  to  such  effects. 

Model  Uncertainty 

Aerodynamic  vehicle  models  are  susceptible  to  two  types  of  ur^certainty:  struc¬ 
tural  and  parametric.  Structural  uncertainty  arises  when  the  assumed  mathe¬ 
matical  form  of  the  equations-of-motion  (e.g.,  the  standard  six-degree-of-freedom 
aircraft  model)  is  unable  to  adequately  describe  the  behavior  of  the  vehicle 
throughout  the  operating  envelope.  This  means  that  no  fixed  (constant)  set  of 
globally  correct  model  parameters  exists.  To  account  for  this,  the  parameters  in 
most  aircraft  models  are  scheduled  as  a  function  of  other  variables.  In  turn,  this 
generally  implies  that  the  functional  relatiomship  between  these  scheduling  vari¬ 
ables  and  the  model  parameters  is  not  known  in  closed-form  (a  second  possibility 
is  that  the  relationship  is  too  complex  to  be  represented  conveniently  in  closed- 
form).  Even  if  the  presumed  equations-of-motion  were  capable  of  accounting  for 
all  aerodynamic  forces  and  moments  (i.e.,  even  if  there  were  no  structural  uncer¬ 
tainty),  empirical  errors  incurred  during  the  estimation  of  the  model  parameters 
(e.g.,  due  to  measurement  noise)  would  still  result  in  some  parametric  uncer¬ 
tainty  and,  hence,  discrepancies  between  the  actual  and  simulated  vehicle  behav¬ 
ior.  In  general,  model-based  control  system  design  can  only  be  as  good  as  the  un- 
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derlying  model,  and  if  there  is  significant  uncertainty  in  this  model,  the  results 
may  be  catastrophic. 

% 

Design  Trends 

As  high  performance  aircraft  continue  to  evolve,  the  desired  flight  envelope  is 
likely  to  expand;  indeed,  the  general  trend  is  towards  new  flight  regimes  that  are, 
at  the  same  time,  more  complex  and  less  understood  (e.g.,  post-stall  maneuver¬ 
ing).  Trends  that  are  hkely  to  exacerbate  the  design  problem  include; 

•  flight  envelope  expansion  increasingly  nonlinear  and  unknown  regimes 

•  additional  control  effectors  (e.g.,  vectored  thrust)  =>  higher  dimensionality 

•  relaxed-static-stability  and  agility  =»  faster  control  response  needed 

In  the  future,  flight  control  system  design  may  become  increasingly  difficult, 
perhaps  even  to  the  point  where  traditional  methods  are  no  longer  adequate,  nor 
cost-effective. 

3.2^ _ Traditional  Design  Approaches 

Three  different  approaches  to  flight  control  system  design  are  discussed  below,  in 
the  context  of  the  design  difficulties  outlined  in  the  previous  subsection.  In  a  gen¬ 
eral  sense,  the  use  of  learning  does  not  preclude  the  use  of  these  other  techniques; 
in  fact,  the  main  advantages  of  learning  are  realized  by  using  it  (in  an  appropri¬ 
ate  manner)  to  augment  existing  control  methodologies. 

Robust  Control 

Robust  control  design  methods  attempt  to  explicitly  incorporate  robustness  to 
parametric  and  structural  model  uncertainty  into  the  control  system  design 
[Maciejowski  (1989)].  In  the  ideal  case  where  the  design  model  is  perfect  and 
>  there  is  no  uncertainty,  maximal  closed-loop  system  performance  can  be  obtained 

through  an  appropriate  optimal  feedback  control  law.  Acknowledging  that  some 
level  of  model  uncertainty  exists,  robust  control  techniques  (e.g.,  H„,  p-synthesis, 

'  or  even  classical  gain  and  phase  margin  based  approaches)  can  be  used  to  design 

a  fixed  parameter  control  system  that  will  provide  a  guaranteed  level  of  (sub- 
maximal)  performance  for  any  plant  in  a  prescribed  set  of  likely  plan..s.  This  set 
might,  for  instance,  be  described  by  a  simplified  model  (typically  linear)  and  a 
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bounded  set  of  nominal  model  parameters  (usually,  a  physically  realistic  range  is 
specified  for  each  model  parameter).  The  parameter  ranges  must  be  large 
enough  to  ensure  that  the  resulting  set  of  models  contains  the  behavior  of  the  ac¬ 
tual  plant.  The  robust  control  system  design  problem  is  essentially  a  minimax 
optimization  problem;  its  solution  tends  to  be  conservative  in  the  sense  that  the 
best  that  can  be  achieved  is  often  dictated  by  the  worst  case  scenario.  Thus,  a 
tradeoff  exists  between  performance  and  robustness,  and  robust  control  designs 
are  achieved  at  the  expense  of  resulting  closed-loop  system  performance — relative 
to  a  control  design  based  on  a  perfect  model. 

To  some  extent,  robustness  to  nonlinear  and  time-vaiying  d5mamical  behavior 
may  also  be  obtained  through  robust  design  methods,  since  these  effects  can  be 
considered  to  be  a  component  of  the  model  uncertainty,  and  can  therefore  be  ac¬ 
commodated  by  increasing  the  level  of  the  prescribed  model  uncertainty.  The  dis¬ 
advantage  of  this  approach  is  that  it  can  result  in  very  conservative  assumptions 
about  the  level  of  uncertainty  and,  thus,  may  lead  to  even  lower  levels  of  overall 
system  performance.  There  is  probably  sufficient  uncertainty  in  the  standard 
aircraft  equations-of-motion  so  that  high  closed-loop  system  performance  can  only 
be  obtained  in  one  of  two  ways;  (i)  through  extensive  manual  (off-line)  tuning  of 
the  nominal  control  law  design,  based  on  flight  test  data  or  (ii)  via  an  automatic 
on-line  adjustment  technique — a  fixed  a  priori  robust  control  design  will  not  nec¬ 
essarily  suffice. 

Gain  Scheduling  /  Manual  Tuning 

A  traditional  control  system  design  methodology  for  high  performance  aircraft  is 
based  on  gain  scheduling  [Astrom  &  Wittenmark  (1989);  Kreisselmeier  (1986); 
Stein  (1986)].  In  this  scheme,  multiple  linear  controllers  are  used  to  approximate 
the  required  nonlinear  control  law.  A  separate  linear  controller  must  be  designed 
for  each  member  of  an  ad  hoc  set  of  distinct  regions  that  together  cover  the 
complete  flight  envelope.  In  each  region  the  dynamical  behavior  of  the  aircraft  is 
assumed  to  be  linear  (multiple  regions  are  required  since  the  aircraft  dynamics 
cannot  be  accurately  represented  by  a  single  linear  model).  Nonlinear  dynamical 
effects  are  addressed  by  transitioning  between  locally  applicable  linear  con¬ 
trollers.  The  complete  nonlinear  control  law  is  realized  by  interpolating  between 
these  separate  linear  controllers  in  a  preprogrammed  manner,  as  a  function  of 
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the  current  state  of  the  vehicle.  The  number  of  distinct  locally  linear  operating 
points  that  might  be  considered  can  reuige  into  the  hundreds. 

% 

Tlie  ad  hoc  and  localized  nature  of  the  gain  scheduling  design  approach  can  re¬ 
sult  in  numerous  design  iterations,  each  involving  manual  redesign  of  the  nomi¬ 
nal  control  law  (for  certain  flight  conditions),  followed  by  extensive  computer 
simulation  to  evaluate  the  modified  control  law.  After  the  initial  control  system 
has  been  designed  and  validated  in  simulation,  further  difficulties  may  arise  be¬ 
cause  the  models  used  during  the  design  process  will  not  always  accurately  re¬ 
flect  the  actual  vehicle  dynamics.  Extensive  on-line  tuning  of  the  nominal  control 
system  may  be  required  to  achieve  satisfactory  performance.  Moreover,  since  the 
feedback  gains  are  scheduled  in  an  open-loop  fashion,  no  automatic  corrective  ac¬ 
tion  is  taken  to  mitigate  the  effects  of  a  control  law  that  is  no  longer  appropriate. 
Hence,  time-var3dng  dynamics  and  other  unanticipated  events  (e.g.,  performance 
degradation  and  changes  in  the  vehicle  configuration  or  environment)  are  only 
indirectly  addressed  through  the  limited  robustness  of  the  final  control  system  de¬ 
sign.  In  the  future,  conventional  gain  scheduled  (and  manually  tuned)  control 
system  design  methods  may  become  increasingly  difficult  as  a  direct  result  of  the 
growing  complexity  and  sophisticatiori  of  new  high  performance  aircraft. 

Adaptive  Contrr^ 

Many  different  adaptive  control  methods  for  nonlinear  and  time-varying  systems 
have  been  investigated  [Astrom  &  Wittenmark  (1989);  Gupta  (1986);  Narendra  & 
Annaswamy  (1989);  Narendra  &  Monopoli  (1980);  Narendra,  Ortega,  &  Dorato 
(1991);  Slotine  &  Li  (1991)].  Two  generic  strategies  are  briefly  described  here.  In¬ 
direct  adaptive  control  approaches  utilize  an  explicit  dynamical  model  of  the  vehi¬ 
cle,  which  is  updated  periodically,  to  synthesize  new  control  laws.  This  approach 
has  the  advantage  that  powerful  design  methods  (including  optimal  control  de- 
sign  techniques)  can  be  employed  on-line;  however,  it  has  the  disadvantage  that 
on-line  model  identification  is  required.  Alternatively,  direct  adaptive  control  ap¬ 
proaches  do  not  rely  upon  a  vehicle  model  and.  thus,  avoid  the  need  to  perform  ex¬ 
plicit  on-line  model  identification.  Instead,  the  control  law  is  adjusted  directly, 
based  on  the  obsen/ed  dynamical  behavior  of  the  vehicle.  In  either  case,  the  con¬ 
trol  system  will  attempt  to  adapt  if  the  behavior  of  the  vehicle  changes  by  a  signifi¬ 
cant  degree. 


39 


Adaptive  control  systems  are  dynamic  systems  and  require  finite  time  intervals  to 
properly  detect  and  account  for  variations  in  the  vehicle  or  its  environment.  If  the 
dynamical  characteristics  of  the  vehicle  vary  considerably  over  its  operating  enve¬ 
lope  (e.g.,  due  to  nonlinearity),  then  the  control  system  may  be  adapting  most  of 
the  time  (i.e.,  it  may  always  be  in  a  "partially"  adapted  state),  resulting  in  de¬ 
graded  performance.  Note  that  this  can  occur  even  in  the  absence  of  time-var3ring 
dynamics  and  disturbances,  since  the  control  system  must  readapt  every  time  a 
different  d3niamical  regime  is  encountered  (i.e.,  one  that  is  outside  the  scope  of  the 
current  control  law).  These  issues  are  particularly  relevant  to  flight  control 
applications,  where  vehicle  behavior  is  strongly  dependent  upon  flight  condition. 
For  these  and  other  reasons,  most  control  system  designs  for  high  performance 
aircraft  have  been  based  on  gain  scheduling,  rather  than  on  adaptive  methods 
[Kreisselmeier  (1986);  Stein  (1986)].  As  will  be  discussed  in  the  next  section, 
learning  systems  may  be  used  to  simultaneously  overcome  some  of  the  shortcom¬ 
ings  and  complement  many  of  the  advantages  of  adaptive  control  systems. 

3^  Adaptation  vs.  Learning 

In  addition  to  modeled  (i.e.,  known)  nonlinearities  or  time-varying  dynamics,  an 
effective  automatic  control  system  must  overcome  difficulties  arising  from  two 
sources:  (i)  noise,  disturbances,  and  unmodeled  time-varying  dynamics  and  (ii) 
unmodeled  nonlinearities,  dynamic  coupling,  and  other  spatial  dependencies. 
The  first  has  a  temporal  emphasis  and  represents  d3nnamica]  features  that  are 
essentially  unpredictable;  in  contrast,  the  second  source  of  difficulty  has  a  spatial 
emphasis  and  represents  dynamical  features  that  are  predictable.  For  example, 
an  advanced  flight  control  system  for  a  high  performance  aircraft  could  eventu¬ 
ally  "learn"  to  anticipate  the  nonlinear  aspects  of  the  vehicle  behavior,  but  could 
never  anticipate  noise,  disturbances,  or  unmodeled  time-varying  dynamics. 

In  much  the  same  way  that  adaptive  approaches  can  be  considered  as  an  exten¬ 
sion  of  fixed  parameter  (nonadaptive)  control  methods,  iere  on-line  adjustment 
of  control  system  parameters  is  used  to  compensate  for  simple  model  uncertainty, 
learning  approaches  can  be  considered  as  an  extension  of  adaptive  control  meth¬ 
ods,  where  on-line  synthesis  of  functional  relationships  is  used  to  accommodate 
more  complex  model  uncertainty.  Adaptive  methods  can  be  used  for  applications 
involving  unknown  (but  constant  or  slowly  time-varying)  model  parameters  like 
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vehicle  inertial  properties,  while  learning  methods  can  be  used  for  unknown  (but 
quasi-static)  spatial  dependencies  like  control  surface  effectiveness  as  a  function 
of  angle-of-attack.  Both  methods  utilize  experiential  information  gained  through 
closed-loop  interactions  with  the  vehicle  and  environment  to  improve  their  per¬ 
formance  during  subsequent  interactions. 

The  key  differences  between  adaptation  and  learning  are  essentially  a  matter  of 
de&ree  and  emphasis.  Adaptive  control  has  a  temporal  emphasis:  its  objective  is 
to  maintain  desired  closed-loop  behavior  in  the  face  of  disturbances  and  dynamics 
that  appear  to  be  time-varying.  In  actuality,  the  changing  dynamics  may  be 
caused  by  unmodeled  nonlinear  effects,  so  that  they  are  really  a  function  of  state 
rather  than  of  time.  Because  the  functional  form  of  most  adaptive  control  laws  is 
generally  incapable  of  representing,  over  a  wide  range  of  operating  conditions,  the 
required  control  action  as  a  function  of  the  cxirrent  vehicle  state,  it  can  be  said  that 
adaptive  controllers  lack  "memory"  in  the  sense  that  they  must  readapt  to 
compensate  for  all  changing  dynamics,  even  those  which  are  nonlinear  (but  time- 
invariant)  and  have  been  experienced  previously.  This  inefficiency  can  result  in 
degraded  performance,  since  transient  behavior  due  to  parameter  adjustment 
may  occur  every  time  the  presumed  dynamical  behavior  of  the  vehicle  changes  by 
a  sufficient  degree. 

In  general,  adaptive  controllers  lack  the  ability  to  distinguish  between  temporally 
and  spatially  dependent  variations  in  the  dynamics  of  a  vehicle.  They  operate,  in 
effect,  by  optimizing  a  small  set  of  adjustable  parameters  to  account  for  veliicle 
behavior  that  is  local  in  both  space  and  time.  To  be  effective,  adaptive  contiollers 
must  have  relatively  fast  d3mamics  so  that  they  can  quickly  react  to  changing  ve¬ 
hicle  behavior. 

Learning  augmented  control  strategies  differ  from  those  of  conventional  adaptive 
control  primarily  vnth  respect  to  "memory"  (in  the  same  sense  as  used  above),  use 
of  past  information,  and  emphasis.  The  contrast  between  adaptation  and 
learning  is  particularly  relevant  to  the  control  of  high  performance  aircraft. 
Learning  augmented  controllers  exploit  an  automatic  mechanism  that  associ¬ 
ates,  throughout  some  operating  envelope,  a  suitable  control  action  or  set  of  con¬ 
trol  systeni  parameters  with  the  current  flight  condition.  In  this  way,  the  pres¬ 
ence  and  effect  of  previously  unknown  nonlinearities  can  be  anticipated  and  ac- 
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counted  for,  based  on  past  experience.  Once  such  a  control  system  has  "learned," 
transient  behavior  that  would  otherwise  be  induced  by  parameter  adaptation  no 
longer  occurs,  resulting  in  greater  efficiency  and  improved  performance  over 
adaptive  control  strategies.  To  accomplish  this,  learning  control  systems  rely  on 
general  function  approximation  schemes  that  may  be  used,  for  example,  to  map 
the  current  flight  condition  to  an  appropriate  set  of  control  system  parameters  (in 
this  regard,  the  net  effect  of  learning  is  similar  to  that  of  gain  scheduling,  with 
the  proviso  that  learning  has  occurred  on-line  with  the  actual  vehicle,  while  gain 
scheduling  is  developed  off-line  via  a  model). 

Learning  control  has  a  spatial  emphasis.  For  example,  its  objective  might  be  to 
S5mthesi2e  a  feedback  control  law  (as  a  function  of  veliicle  state)  that  provides  the 
desired  closed-loop  behavior  in  the  presence  of  unmodeled  nonlinear  dynamics. 
Alternatively,  learning  can  be  used  to  synthesize  a  mapping  from  flight  condition 
to  a  set  of  linear  model  parameters,  which  can  then  be  used  for  on-line  control  law 
design. 1  Learning  systems  operate  by  optimizing  over  a  large  set  of  adjustable 
parameters  (and  potentially  variable  structural  elements  ICerrato  (1993)])  to 
construct  a  mapping  representing  the  quasi-static  spatial  dependencies,  aga? 
throughout  the  operating  envelope.  In  effect,  this  optimization  is  global  in  state- 
space.  To  successfully  execute  this  optimization  process,  learning  systems  make 
extensive  use  of  past  information  and  employ  relatively  slow  learning  dynamics. 

Ais  defined,  the  processes  of  adaptation  and  learning  are  complementary:  each 
has  unique  desirable  characteristics  from  the  point  of  view  of  flight  control.  For 
example,  adaptive  approaches  address  the  problem  of  slowly  time-varying  dy¬ 
namics  and  novel  situations  (e.g.,  those  which  have  never  before  been  experi¬ 
enced),  but  are  inefficient  for  problems  involving  significant  unknown  spatial  de¬ 
pendencies.  Learning  approaches,  in  contrast,  have  the  opposite  characteristic; 
they  are  well-equipped  to  accommodate  nonlinear  vehicle  dynamics,  but  are  not 
well-suited  to  applications  involving  time-varying  dynamics. 


^  This  Second  approach  is  the  learning  analog  to  indirect  adaptive  control,  whereas  the  first 
approach  n.  analogous  to  direct  adaptive  control;  cf  (Atkins  (1993)] 
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3.4  Motivation  for  Hybrid  Adaptivn^^eaming  Control 


This  section  briefly  describes  hybrid  control  system  architectures  that  exhibit  both 
adaptive  and  learning  behaviors.  These  hybrid  structures  incorporate  adaptation 
and  learning  in  a  synergistic  manner.  In  such  schemes,  an  adaptive  system  is 
coupled  with  a  learning  system  to  provide  real-time  adaptation  to  novel  situations 
and  slowly  time-varying  dynamics,  in  conjunction  with  learning  to  accommodate 
stationary  or  quasi-stationary  state-space  dependencies  (e  g.,  memoryless  nonlin¬ 
earities).  The  adaptive  control  system  reacts  to  discrepancies  between  the  desired 
and  observed  behaviors  of  the  plant,  to  maintain  the  requisite  closed-loop  system 
performance.  These  discrepancies  may  arise  from  time-var3dng  d3rnamics,  dis¬ 
turbances,  or  unmodeled  d)mamics.  In  practice,  little  can  be  done  to  anticipate 
time-var3nng  dynamics  and  disturbances;  thus,  these  phenomena  are  usually 
handled  through  feedback  in  the  adaptive  system.  In  contrast,  the  effects  of  some 
unmodeled  dynamics  (in  particular,  static  nonlinearities)  can  be  predicted  from 
previous  experience.  This  is  the  task  given  to  the  learning  system.  Initially,  all 
unmodeled  behavior  is  handled  by  the  adaptive  system;  eventually,  however,  the 
learning  system  is  able  to  anticipate  previously  experienced,  yet  initially  unmod¬ 
eled  behavior.  Thus,  the  adaptive  system  can  concentrate  on  novel  situations 
(where  little  or  no  learning  has  occurred)  and  slowly  time-vaj-ying  behavior. 

Two  general  hybrid  architectures  are  outlined  in  this  section.  The  discussion  of 
these  architectures  parallels  the  usual  presentation  of  direct,  and  indirect  adap¬ 
tive  control  strategies.  In  each  approach,  the  learning  system  is  used  to  alleviate 
the  burden  on  the  adaptive  controller  of  continually  reacting  to  predictable  state- 
space  dependencies  in  the  dynamical  behavior  of  the  plant  (e.g.,  stationary,  mem¬ 
oryless  nonlinearities).  Note  that  various  technical  issues  must  be  addressed  to 
guarantee  the  successful  implementation  of  these  approaches.  For  example,  to 
ensure  both  the  stability  and  robustness  of  the  closed-loop  systiim  (which  includes 
both  the  adaptive  and  learning  systems,  as  well  as  the  plant),  one  must  address 
issues  related  to;  controllability  and  observability,  the  effects  of  noise,  distur¬ 
bances,  model-order  errors,  and  other  uncertainties;  parameter  convergence, 
sufficiency  of  excitation,  and  nonstationarity,  computational  requirements,  time- 
delays,  and  the  effects  of  finite  precision  arithmetic.  Many  (if  i:iot  all)  of  these  is¬ 
sues  arse  in  the  implementation  of  traditional  adaptive  contro.  systems;  as  such, 
there  are  some  existing  sources  one  may  refer  to  in  the  hope  of  addressing  these 
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issues  (e.g.,  see  (Astrom  &  Wittenmark  (1989);  Gupta  (1986);  Narendra  &  Anna- 
swamy  (1989);  Narendra  &  Monopoli  (1980);  Narendra,  Ortega,  &  Dorato  (1991); 
Slotine  &  Li  (1991)]).  Although  these  topics  are  well  beyond  the  scope  of  this  re¬ 
port,  in  some  instances  the  learning  augmented  approach  appears  to  offer  opera¬ 
tional  advantages  over  the  corresponding  adaptive  approach  (with  respect  to  such 
implementation  issues). 

3.4.1  Direct  Implementation 

In  the  typical  direct  adaptive  control  approach  (see  Fig.  3.1),  each  control  action  u 
is  generated  based  on  the  measured  and  desired  plant  outputs,  internal 
state  of  the  controller,  and  estimates  of  the  pertinent  control  law  parameters  k. 
The  estimates  of  the  control  law  parameters  are  adjusted,  at  each  time-step,  based 
on  the  error  e  between  the  measured  plant  outputs  and  the  outputs  of  a  reference 
system  y^.  Of  course,  care  must  be  taken  to  ensure  that  the  pla  it  is  actually 
capable  of  attaining  the  performance  specified  by  the  selected  reference  system. 
Direct  adaptive  control  approache.s  do  not  rely  upon  an  explicit  plant  model  and, 
thus,  avoid  the  need  to  perform  on-line  system  identification. 

The  controller  in  Fig.  3.1  is  structured  so  that  normal  adaptive  operation  would 
result  if  the  learning  s3^stem  were  not  implemented.  The  reference  represents  the 
desired  behavior  for  the  augmented  plant  (controller  plus  plant),  while  the  adap¬ 
tive  mechanism  is  used  to  transform  the  reference  error  directly  into  a  correction 
Ak  for  the  current  control  system  parameters.  The  adaptation  algorithm  can  be 
developed  and  implemented  in  several  different  ways  (e  g.,  via  gradient  or  Lya¬ 
punov  based  techniques — see  (Astrom  &  Wittenmark  (1989);  Narendra  &  An- 
naswamy  (1989);  Slotine  &  Li  (1991)]).  Learning  augmentation  can  be  accom¬ 
plished  by  using  the  learning  system  to  store  the  required  control  system  parame¬ 
ters  as  a  function  of  the  operating  condition  of  the  plant  [Farrell  &  Baker  (1992); 
Vos,  Baker,  &  Millington  (1991)].  Alternatively,  learning  can  be  used  to  store  the 
appropriate  control  action  as  a  function  of  the  actual  and  desired  plant  outputs 
[Farrell  &  Baker  (1991)].  The  architecture  in  Fig.  3.1  shows  the  first  case. 


Figure  3.1.  Direct  Adaptive/Learning  Approach, 

When  the  learning  system  is  used  to  store  the  control  system  parameters  as  a 
function  of  the  plant  operating  condition,  the  adaptive  system  would  provide  any 
required  perturbation  to  the  control  parameters  k  generated  by  the  learning  sys¬ 
tem.  The  signal  from  the  control  block  to  the  learning  system  in  Fig.  3.1  is  the 
perturbation  in  the  control  parameters  <5k  to  be  associated  with  the  previous  oper¬ 
ating  condition.  This  association  (incremental  learning)  process  is  used  to  com¬ 
bine  the  estimate  from  the  adaptive  system  with  the  control  parameters  that  have 
already  been  learned  for  that  operating  condition.  At  each  sampling  instant,  the 
learning  system  generates  an  estimate  of  the  control  system  parameters  k  asso¬ 
ciated  with  that  operating  condition,  ard  then  passes  this  estimate  to  the  con¬ 
troller  where  it  is  combined  with  the  perturbation  parameter  estimates  main¬ 
tained  by  the  adaptive  system,  and  used  to  generate  the  control  action  u.  In  the 
ideal  limit  where  perfect  learning  has  occurred,  and  there  is  an  absence  of  noise, 
disturbances,  and  time- varying  dynamics,  the  correct  parameter  values  would 
always  be  supplied  by  the  learning  system,  so  that  both  the  perturbations  i>k  and 


45 


corrections  Ak  generated  by  the  adaptive  system  would  become  zero.i  Under 
more  realistic  assumptions,  there  would  be  some  small  degradation  in  perfor¬ 
mance  due  to  adaptation  (e.g.,  5k  and  Ak  might  not  be  zero  due  to  noise). 

In  the  cast  where  the  learning  system  is  trained  to  store  control  action  directly  as 
a  function  of  the  actual  and  desired  operating  conditions  of  the  plant,  the  adaptive 
system  would  provide  any  required  perturbation  to  the  control  action  generated  by 
the  learning  system.  Note  that  a  d)mamic  mapping  would  have  to  be  synthesized 
by  the  learning  system  if  a  dynamic  feedback  law  were  desired  (which  was  not 
necessary  in  the  first  case).  The  advantage  of  this  approach  over  the  previous  one 
is  that  a  more  general  control  law  can  be  learned.  The  disadvantage  is  that  addi¬ 
tional  memory  is  required  and  that  a  more  di&cult  learning  problem  must  be  ad¬ 
dressed. 


In  the  typical  indirect  adaptive  control  approach  (see  Fig.  3.2),  each  control  action 
w  is  generated  based  on  the  measured  y„  and  desired  plant  outputs,  internal 
state  of  the  controller,  and  estimated  parameters  p„  of  a  local  plant  model.  The 
parameters  k  for  a  local  control  law  are  explicitly  designed  on-line,  based  on  the 
observed  plant  behavior.  If  the  behavior  of  the  plant  changes  (e.g.,  due  to  nonlin¬ 
earity),  an  estimator  automatically  updates  its  model  of  the  plant  as  quickly  as 
possible,  based  on  the  information  available  from  the  (generally  noisy)  output 
measurements.  The  indirect  approach  has  the  important  advantage  that  power¬ 
ful  design  methods  (including  optimal  control  techniques)  may  potentially  be  used 
on-line.  Note,  however,  that  computational  requirements  are  usually  greater  for 
indirect  approaches  since  both  model  identification  and  control  law  design  are 
performed  on-line 

If  the  learning  system  in  Fig.  3.2  were  not  implemented,  then  this  structure 
would  represent  the  operation  of  a  traditional  indirect  adaptive  control  system. 
The  signal  is  the  adaptive  estimate  of  the  plant  model  parameters.  This  signal 

^  In  this  case,  ti  •  system  architecture  is  similar  to  that  used  in  gam  scheduhtig,  with  the  proWso 
that  learning  has  occurred  on-line  with  the  actual  plant,  while  a  gam  schedule  is  developed 


is  used  to  calculate  the  control  law  parameters  k.  Incorporation  of  the  learning 
system  would  allow  the  plant  model  parameters  to  be  learned  as  a  function  of  the 
plant  operating  condition.  Tlie  model  parameters  generated  by  the  learning  sys¬ 
tem  allow  previously  experienced  plant  behavior  to  be  anticipated,  leading  to  im¬ 
proved  control  law  design  [Baird  &  Baker  (1990)].  In  this  case,  the  output  of  the 
learning  system  p,  to  both  the  control  design  block  and  the  estimator  is  an  a  priori 
estimate  of  the  model  parameters  associated  with  the  cvurent  operating  condition. 
An  a  posteriori  parameter  estimate  from  the  estimator  (involving  both 
filtering  and  posterior  smoothing)  is  used  to  update  the  mapping  stored  by  the 
learning  system.  The  system  uses  model  parameter  estimates  from  both  the 
adaptive  and  learning  systems  to  execute  the  control  law  design  and  determine 
the  appr  priate  control  law  parameters.  In  situations  where  the  design  proce¬ 
dure  is  complex  and  time-consuming,  the  control  law  parameters  might  also  be 
stored  (via  a  separate  mapping  in  the  learning  system)  as  a  function  of  the  plant 
operating  condition.  Thus,  contro'  law  design  could  be  performed  at  a  lower  rate, 
assuming  that  the  control  parameter  mapping  maintained  by  the  learning  sys- 
fp'M  ;vas  sufficiently  accurate  to  provide  reasonable  control  in  iieu  of  design  at  a 
higt  t  r  rate. 


Figure  3.2.  Indirect  Adaptive/Learning  Approach. 
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In  both  of  the  hybrid  implementations  described  in  this  section,  the  learning  sys¬ 
tem  (prior  to  any  on-line  interaction)  would  only  contain  knowledge  derived  from 
the  design  model.  During  initial  closed-loop  operation,  the  adaptive  system  would 
be  used  to  accommodate  any  inadequacies  in  the  a  priori  design  knowledge.  Sub¬ 
sequently,  as  experience  with  the  actual  plant  was  accumulated,  the  learning 
sy-item  would  be  used  to  anticipate  the  appropriate  control  or  model  parameters 
as  a  function  of  the  current  plant  operating  condition.  The  adaptive  system  would 
remain  active  to  handle  novel  situations  and  limitations  of  the  learning  system 
(e.g.,  finite  accu»-acy).  With  perfect  learning,  but  no  noise,  disturbances,  or  time- 
var3dng  behavior  in  the  plant,  the  contribution  from  the  adaptive  system  would 
eventually  become  zero.  In  the  presence  of  noise  and  disturbances,  the  contribu¬ 
tion  from  the  adaptive  system  would  become  small,  but  nonzero  (depending  on  the 
hybrid  scheme  used,  however,  the  effect  of  this  contribution  might  be  negligible). 
In  the  general  case  involving  all  of  these  effects,  the  hybrid  control  system  should 
perform  better  than  either  subsystem  individually.  It  can  be  seen  that  adaptation 
and  leaiming  are  complementary  beha^^nors,  and  that  they  can  be  used  simultane¬ 
ously  (for  purposes  of  automatic  control)  in  a  synergistic  fashion,  ITiese  points 
will  be  further  brought  out  in  Chapters  4  and  5. 

3^  I^ieamiug  as  Function  Synthesis 

For  a  wide  and  important  class  of  learning  control  problems,  the  desired  mapping 
is  known  (or  assumed)  to  be  continuous  in  advance.  In  such  situations,  memory 
implementations  with  efficient  storage  mechanisms  vran  be  proposed.  By 
assuming  that  the  desired  mapping  M* :  x  y"  is  continuous,  an  approximate 
mapping  M  ;  x  -  >  y  can  be  implemented  by  any  scheme  capable  of  approximating 
arbitraiy  continuous  function.s.  In  such  cases,  the  mapping  M  is  represented  as 
a  continuous  function,  parameterized  by  a  vector  p;  i.e.,  M  =  M(x;p).  The  learn¬ 
ing  update  step  would  be  achieved  by  appropriately  adjusting  the  parameter  vector 
p  by  an  amount  Ap  (yet  to  be  determined).  By  "appropriate,"  we  mean  that  the 
adjusted  parameter  vector  p  =  p  + Ap  is  such  that  the  resulting  y  =  M(x;p)  would 
be  "better"  than,  the  original  y ,  relative  to  the  desired  response  y*.  As  ne  v  )'  irn- 
ing  expei  iences  became  available,  the  mapping  M  Vvould  be  incrementally  im- 


proved.  Recall  would  be  achieved  by  evaluating  the  functional  mapping  at  a 
paiticular  point  in  its  input  domain. 

In  this  parameterized  approach  to  function  synthesis,  the  knowledge  that  is 
gained  over  time  is  stored  in  a  distributed  fashion  in  the  parameter  space  of  the 
mapping.  This  feature,  which  arises  naturally  in  any  practical  implementation 
of  a  continuous  mapping,  can  be  most  desirable  from  a  learning  control  point  of 
view  (depending  on  the  way  it  is  achieved,  as  discussed  below).  Distributed  learn¬ 
ing  is  advantageous  when  previous  learning  under  similar  circiimstances  can  be 
combined  to  provide  a  suitable  response  for  the  current  situation.  This  fusion 
process  effectively  broadens  the  scope  and  influence  of  each  learning  experience 
and  is  referred  to  as  generalization. 

There  are  several  important  ramifications  of  generalization.  First,  it  has  the  ef¬ 
fect  of  eliminating  "blank  spots"  in  the  memory  (i.e.,  specific  points  at  which  no 
learning  has  occurred),  since  some  response  (albeit  not  necessarily  the  desired 
one)  will  always  be  generated.  Second,  it  has  the  effect  of  constraining  the  set  of 
possible  inpiit/output  mappings  that  can  be  achieved  by  the  mapping,  since  in 
most  cases  neighboring  input  situations  will  result  in  similar  outputs.  Finally, 
generalization  complicates  the  learning  process,  since  the  adjustment  of  the 
mapping  following  a  learning  experience  cannot  be  considered  as  an  indepen¬ 
dent,  point-by-point  process  (e.g.,  as  in  BOXES  [Michie  &  Chambers  (1968)]).  In 
spite  of  tliis,  the  advantages  accorded  by  generalization  usually  far  outweigh  the 
difficulties  it  evokes. 

Generalization  is  an  iatrimsic  feature  of  function  synthesis  approaches  that  rely 
on  parameterized  continuous  mappings.  In  any  practical  implementation  hav¬ 
ing  a  finite  number  of  adjustable  parameters,  each  adjustable  parameter  will  af¬ 
fect  the  realized  function  ov  -r  a  region  of  nonzero  measure,  Wlien  a  single  pa¬ 
rameter  Pj  (from  the  set  p  -  is  adjusted  to  iiuprove  the  approxima¬ 

tion  at  a  specific  point  x,  the  continuous  mapping  M  (i.e.,  at  least  one  of  the  out¬ 
puts  of  M  =  will  be  affected  throughout  the  region  of  "influence" 

of  that  parameter.  This  region  of  influence  is  determined  by  the  partial  deriva¬ 
tives  dMJdp^  (one  for  each  output  of  M),  which  are  functions  of  the  input  x. 
Unde.’’  these  conditions,  the  effect  of  a  learning  experience  will  be  generalized  au- 
tomalicatly,  and  extended  to  all  parts  of  the  mapping  in  which  the  "sensitivity" 
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fiinctions  dM^jdpj  are  nonzero.  The  greatest  effect  will  occur  where  is 

largest;  little  or  no  change  will  occur  wherever  this  quantity  is  small  or  zero.  The 
nature  of  this  generalization  may  or  may  not  be  beneficial  to  the  learning  process 
depending  on  whether  the  extent  of  the  generalization  is  local  or  global.  These  is¬ 
sues  are  further  discussed  in  Subsection  3.5.3. 


For  function  synthesis  approaches  based  on  parameterized  representations,  the 
learning  process  requires  an  algorithm  that  will  specify  an  appropriate  Ap  so  as 
to  achieve  some  desired  objective.  When  the  mathematical  structure  used  to  im¬ 
plement  the  mapping  is  continuously  differentiable  and  the  objective  function  J 
can  be  treated  as  a  "cost"  to  be  minimized,  then  the  construction  of  Ap  can  be 
straightforward.  In  the  special  case  where  the  adjustable  parameters  p  appear 
linearly  in  the  gradient  vector  dJ  of  the  cost  fiinctionJ  with  respect  to  the  ad¬ 
justable  parameters  p ,  the  optimization  could  be  treated  as  a  linear  algebra  prob¬ 
lem;  in  general  (i.e.,  for  most  applications),  nonlinear  optimization  methods  must 
I  e  used.  One  nonlinear  technique  that  is  suitable  for  on-line  learning  is  the  gradi  ¬ 
ent  learning  algorithm:  Ap  = -W  (<?J/c^p)^,  where  W  is  a  symmetric  positive 
definite  matrix  that  determines  the  "learning  rate,"  and  the  gradient  dJ/dp  is 
defined  to  be  a  row  vector.  If  a  second-order  Taylor  expansion  is  used  to  provide  a 
local  approximation  of  the  objective  function  J  (about  the  current  parameter 
vector  p),  then  the  "optimum"  W  which  minimizes  this  local  quadratic  cost  func¬ 
tion  in  a  single  stop  can  be  shown  to  be  equal  to  the  inverse  of  the  Hessian  matrix 
H  (of  J),  so  that 


=  H 


-1 


d\T 


(3.1) 


This  equation  is  only  valid  when  the  local  Hessian  matrix  is  positive  definite.  Be¬ 
cause  it  is  difficult  to  compute  and  invert  the  Hessian  on-line,  the  weight  matrix 
W  is  usually  only  an  approximation  of  the  full  Hessian,  as  in  the  Levenberg-Mar- 
quardt  method  [Press,  et  al.  (1988)].  Often,  in  fact,  a  single  learning  rate  coeffi¬ 
cient  a  i.s  used  to  .set  'W  =  al. 


More  insight  can  be  gainad  into  the  gradient  learning  algorithm  through  an  ap 
plication  of  the  chain  rub  ,  which  yields:  Ap  = -VI  [dy /dpf  ■  {dJ /dyf  (where  the 
Jacobian  dyidp  is  defin(^d  as  a  matrix  of  gradient  row  vectors  dyjdp,  so  that 
dJ/dp-  [dJ /dy)-\dy ! (dp)).  This  form  of  the  gradient  learning  rule  involves  two 
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types  of  information:  the  Jacobian  of  the  outputs  of  the  mapping  with  respect  to 
the  adjustable  parameters,  and  the  gradient  of  the  objective  function  with  respect 
to  the  mapping  outputs.  The  gradient  dJ jdy  is  determined  both  by  the  specifica¬ 
tion  of  the  objective  function  J  and  the  manner  in  which  the  mapping  outputs  af¬ 
fect  this  function  (which,  in  turn,  is  determined  by  the  way  in  which  the  learning 
system  is  used  within  the  control  system  architecture).  The  Jacobian  dy/dp  is 
completely  detentnined  by  the  approximation  structure  M  and,  hence,  is  known  a 
priori  as  a  function  of  the  input  x .  Note  that  the  performance  feedback  informa¬ 
tion  provided  to  the  learning  system  is  the  output  gradient  dJ/dy.  This  gradient 
vector  provides  the  learning  system  with  considerably  more  information  than  the 
scalar  J ;  in  particular,  dJ Idy  indicates  both  a  direction  and  magnitude  for  Ap 
(since  <9y/<9p  is  known),  whereas  performance  feedback  based  solely  on  the 
scalar  J  does  neither. 

To  give  an  illustrative  example,  a  simple  quadratic  objective  function  might  be  de¬ 
fined  as 

J  =  (3.2) 

E 

where  J  is  the  cost  to  be  minimized  (over  a  finite  set  of  evaluation  points 
X;  e  E  =  {xi,x^, ..  ,Xj,})  and  the  output  errors  e,  =y’ -y,  =  M’(x,)-M(x,)  are  as¬ 
sumed  to  be  knowm.  In  the  special  case  where  the  objective  function  is  given  by 

(3.2)  and  W  =  al,  the  learning  rule  is 

If  the  objective  function  is  a  strictly  convex  function  of  p,  then  the  gradient  algo¬ 
rithm  will  find  the  optimum  value  p*  that  minimizes  J.  For  most  practical 
learning  control  problems,  however,  the  situation  is  much  more  complicated. 
The  objective  function  J  to  be  minimized  may  involve  terms  that  are  only  known 
implicitly  (e.g.,  the  desired  output  y*  may  not  be  explicitly  known  or,  equivalently, 
the  output  error  e  of  the  mapping  may  not  be  measurable);  moreover,  J  may  be 
significantly  more  complex  than  that  shown  in  (3.2)  (e.g.,  J  may  be  a  dynamic 
rather  than  a  static  function).  Finally,  for  reasons  that  will  be  discussed  in  Sub¬ 
section  3.5.2.  objective  functions  defined  over  a  finite  set  of  evaluation  points  (as  in 

(3.2) )  cannot  usually  be  used  directly  for  on-line  learning  control. 


51 


As  with  all  gradient  based  optimization  techniques,  there  exists  a  possibility  of 
converging  to  a  local  minimum  if  the  objective  function  is  not  convex.  This  point 
together  with  the  preceding  discussion  suggests  two  desiderata  for  learning  con¬ 
trol  systems  emplojdng  gradient  learning  methods:  first,  the  architecture  should 
allow  for  the  determination  (or  accurate  estimation)  of  the  gradient  dJjdy  and, 
second,  the  cost  function  J  should  be  a  convex  function  of  the  adjustable  parame¬ 
ters  p.  Note  that  it  may  be  possible  to  determine  or  estimate  dJjdy  without  ever 
knowing  y*. 

3.5.1  Connectionist  Learning  Systems 

Connectionist  systems,  including  what  are  often  called  "artificial  neural  net¬ 
works,"  have  been  suggested  by  many  authors  to  be  ideal  structures  for  the  im¬ 
plementation  of  learning  control  systems.  A  typical  connectionist  system  is  orga¬ 
nized  in  a  network  architecture  that  is  comprised  of  nodes  and  connections  be¬ 
tween  nodes.  Each  node  can  be  thought  of  as  a  simple  processing  unit,  with  a 
number  of  adjustable  parameters  (which  do  not  have  to  appear  linearly  in  the 
nodal  input/output  relationship).  Typically,  the  number  of  different  node  t3T)es  in 
a  network  is  small  compared  to  the  total  number  of  nodes.  Common  examples 
include  multilayer  sigmoidal  [Rumelhart,  Hinton,  &  Williams  (1990)]  and  radial 
basis  function  [Poggio  &  Girosi  (1990)]  networks. ^  The  popularity  of  such  systems 
arises,  in  part,  because  they  are  relatively  simple  in  form,  ai’e  amenable  to  gradi¬ 
ent  learning  methods,  and  can  be  implemented  in  parallel  computational  hard¬ 
ware. 

Perhaps  more  importantly,  however,  it  is  well  known  that  several  classes  of  con¬ 
nectionist  systems  have  the  universal  approximation  pri  perty.  This  property  im¬ 
plies  that  any  continuous  function  can  be  approximated  to  a  given  degree  of  accu¬ 
racy  by  a  sufficiently  large  network  [Funahashi  (1989);  Hornik,  Stinchcombe,  & 


^  We  do  not  consider  any  recurrent  networks  (i.e.,  networks  having  internal  feedback  and, 
hence,  internal  state)  in  this  discussion  for  the  simple  reason  that  any  recurrent  network 
representing  a  continuous  or  discrete-time  dynamic  mapping  can  be  expressed  as  an 
equivalent  dynamical  system  comprised  of  two  static  mappings  separated  by  eithar  an 
integration  or  unit  delay  operator.  In  other  words,  the  problem  can  always  be  decomposed  into 
two  component  problems:  that  of  estimating  the  parameters  of  the  static  mappings  Eind  that  of 
estimating  the  state  of  the  dynamical  system  (e  g.,  \'ia  an  extended  Kalman  filter  [Livstone, 
FaiTeli,  &  Baker  (1992)]). 
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White  (1989)].  Although  the  universal  approximation  property  is  important,  it  is 
held  by  so  many  different  approximation  structures  that  it  does  not  form  a  suit¬ 
able  basis  upon  which  to  distinguish  them.  Thus,  we  must  ask  what  other  at¬ 
tributes  are  important  in  the  context  of  learning  control.  In  particular,  we  must 
look  beyona  the  initial  biological  motivations  for  connectionist  systems  and  deter¬ 
mine  whether  they  indeed  hold  any  advantage  over  more  traditional  approxima¬ 
tion  schemes.  An  important  factor  to  consider  is  the  environment  in  which 
learning  will  occur.  Thus,  for  example,  the  quantity,  quality,  and  content  of  the 
information  that  is  likely  to  be  available  to  the  learning  system  during  its  opera¬ 
tion  critically  impact  its  performance,  and  should  be  accounted  for  in  the  selection 
of  a  suitable  learning  approach. 

The  particular  scenarios  that  w’e  will  consider  involve  the  use  oi passive  learning 
strategies;  that  is,  learning  schemes  that  are  opportunistic  and  exploit  whatever 
information  happens  to  be  available  during  the  normal  course  of  operation  of  the 
closed-loop  system.  In  contrast,  one  might  also  consider  active  learning  strate¬ 
gies,  in  which  the  learning  control  system  not  only  attempts  to  drive  the  outputs  of 
the  plant  along  a  desired  trajectory,  but  also  explicitly  seeks  to  improve  the  accu¬ 
racy  of  the  mapping  maintained  by  the  learning  system.  This  is  achieved  by  in¬ 
troducing  "probing"  signals  that  direct  the  plant  into  regions  of  its  state-space 
where  insufficient  learning  has  occurred.  Active  learning  control  is  analogous  to 
dual  (adaptive)  control  (Astrbm  &  Wittenmark  (1989)].  Because  we  wish  to  focus 
on  passive  learning  strategies,  the  learning  systems  we  consider  must  be  capable 
of  accommodating  on-line  measurements  and  performance  feedback  that  arise 
during  the  normal  opei  ation  of  the  closed-loop  system.  This  situation  presents 
special  challenges,  as  discussed  in  the  next  subsection. 

3.5.2  Incremental  Learning  Issues 

If  the  goal  is  to  have  learning  occur  on-line,  in  conjunction  with  a  plant  that  can 
be  nominally  modeled  as  the  discrete-time  dynamical  system 

y*  =  h(x„u,) 

where  f(  ,  )  and  h(  ,  )  are  continuous,  then  an  objective  function  of  the  form  given 
by  (3.2)  cannot  be  used  directly.  Tlie  main  problem  is  that  the  set  of  possible  inputs 
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to  the  mapping  maintained  by  the  learning  system  will  noi,  consist  of  a  finite  set  of 
discrete  points.  Consequently,  it  will  not  be  easy  way  to  select  a  finite  set  of 
representative  evaluation  points  eE,  nor  will  it  be  possible  to  guarantee  that 
any  or  all  of  them  are  ever  visited.  In  general,  the  inputs  z  to  the  learning  system 
will  be  comprised  from  measured  or  estimated  values  of  {x,u,y} — which  repre¬ 
sent  a  continuum.  Fortunately,  various  alternative  objective  functions  that  ap¬ 
proximate  (3.2)  are  feasible  and  are  often  used  in  practice.  For  example,  one  ap¬ 
proach  would  be  to  allow  the  set  E  to  grow  on-line  to  include  all  Z;  as  they  are  en¬ 
countered;  i.e., 

E*  ={zi,Z2,  .  ,z*}  (3.4) 

In  the  special  case  where  the  adjustable  parameters  p  appear  linearly  in  the 
gradient  dJfdp  of  (3.2)  and  E  is  given  by  (3.4),  recursive  linear  estimation  tech¬ 
niques  (e.g.,  RLS)  could  be  used  to  obtain  the  "optimum"  parameter  vector  p* 
(corresponding  to  the  particular  set  E).  In  most  connectionist  networks,  how¬ 
ever,  some  or  all  of  the  adjustable  parameters  appear  nonlinearly  in  «?«//^p; 
hence,  linear  optimization  methods  jannot  be  used.  Moreover,  evaluation  sets  of 
the  form  given  by  (3.4)  are  difficult  to  employ  in  a  nonlinear  setting. 

By  far,  the  most  common  objective  function  used  for  on-line  learning  in  control 
applications  is  the  point-wise  function  given  by 

J  =  je^e  (3.5) 

(3.5)  can  be  considered  as  a  special  case  of  (3.2)  when  the  evaluation  set  E  con¬ 
tains  a  single  point  at  each  sampling  in.stant.  Learning  algorithms  that  seek  to 
minimize  point-wise  objective  functions  in  lieu  of  objective  functions  defined  over 
a  continuum  are  referi’ed  to  as  incremental  learning  algorithms;  they  are  related 
to  a  broad  class  of  stochastic  approximation  methods  [Gelb  (1974)].  Incremental 
gradient  learning  algorithms  operate  by  approximating  the  actual  gradient 
dJ  I  dp  of  (3.2)  with  an  instantaneous  estimate  of  the  gradient,  based  on  (3.5).  In¬ 
cremental  gradient  learning  :>Igorithms  of  this  form  are  related  to  stochastic  gra¬ 
dient  methods  [Haykin  (1991)].  The  m  of  point-wise  objr  live  functions  to  approx¬ 
imate  batch  (or  ensemble)  objective  funcHons  (i.e.,  those  in  which  E  contains 
more  than  one  point)  will  generally  not  be  successful  unless  special  attention  is 
given  to  the  distribution  of  the  evaluation  points,  the  form  of  the  learning  algo¬ 
rithm,  and  the  structure  of  the  network.  We  will  have  more  to  say  concerning 
this  point  in  the  next  subsection. 


One  well-known  and  widely  used  stochastic  gradient  algorithm  is  the  least-mean- 
square  (LMS)  algorithm  [Widrow  &  HofF (I960)].  The  LMS  parameter  adjustment 
law  is  Ap  -  -a{dJ ,  where  the  gradient  dJ/d^  is  based  on  (3.5).  Given  cer¬ 
tain  assumptions  (e.g.,  linearity,  stationarity,  Gaussian-distributed  random 
variables,  etc.),  LMS  can  be  shown  to  be  convergent,  relative  to  the  objective  func¬ 
tion  of  (3.2),  with  E  given  by  (3.4).  In  this  case,  the  LMS  algorithm  is  guaranteed 
to  be  convergent  in  the  mean  and  mean-square^  i.e., 

hm £;(p* )  =  and  1^ E{J^)  = 

where  E(  )  denotes  expected  value,  if  the  learning  rate  coefficient  a  (a  constant) 
satisfies  conditions  related  to  the  eigenvalues  of  the  input  cori  elation  matrix  of 
(e.g.,  a  cannot  be  too  large)  (Haykin  (1991)].  In  the  first  limit,  as  the  number  of 
learning  experiences  goes  to  infinity,  the  expected  value  of  the  parameter  vector 
approaches  that  of  the  optimum  parameter  vector  p^^,  corresponding  to  the 
Wiener  solution  for  this  problem  (which  achieves  In  the  second  limit,  the 

expected  value  of  the  cost  (which  is  the  mean-square  error),  also  approaches  a 
limit,  but  not  the  minimum  value  achieved  by  the  optimum  (Wiener)  solution. 
Under  these  same  conditions,  convergence  of  the  parameter  vector  (not  its  ex¬ 
pected  value)  to  the  optimum  value,  i.e., 

limp*  =p„^, 

can  be  obtained  if  the  learning  rate  coefficient  decreases  at  a  special  rate  over  time 
(e.g.,  a*  -  X'k)  [Gelb  (1974)].  Although  the  theory  supporting  the  stability  and 
convergence  of  the  LMS  algorithm  only  applies  to  the  special  case  of  a  linear  net¬ 
work  (among  other  assumptions),  the  basic  strategy  underlying  LMS  has  been 
used  to  formulate  a  simple  learning  algorithm  for  nonlinear  networks.  In  this 
case,  the  parameter  adjustment  law  becomes 

Ap  =  a-r“  e  (3.6) 

r?p 

where  dJ/dp  is  based  on  (3.5)  (with  e  =  y*  -y).  (3.6)  represents  the  standard  in¬ 
cremental  gradient  algorithm  used  by  many  practitioners  for  on-line  ieam  ng 
control. 


Special  constraints  are  placed  on  a  learning  system  whenever  learning  is  to  occur 
on-line,  during  closed-loop  operation;  these  constraints  can  impact  the  network 
architecture,  learning  algorithm,  and  training  process.  Assuming  a  passive 
learning  system  is  being  employed,  the  learning  experiences  (training  examples) 
cannot  be  selected  freely,  since  the  plant  state  (and  outputs)  are  constrained  by  the 
system  dynamics,  and  the  desired  plant  outputs  are  constrained  by  the  specifica¬ 
tions  of  the  control  problem  (without  regard  to  learning).  Under  these  conditions, 
the  system  state  may  remain  in  small  regions  of  its  state-space  for  extended  peri¬ 
ods  of  time  (e.g.,  near  setpoints).  In  turn,  this  implies  that  the  data  z  used  for  in¬ 
cremental  learning  will  remain  in  small  regions  of  the  input  domain  of  the  map¬ 
ping  being  synthesized.  Such  stasis  can  cause  undesirable  side-effects  in  situa¬ 
tions  where  parameter  adjustments  (based  on  incremental  learning  algorithms) 
have  a  nonlocal  effect  on  the  mapping  maintained  by  the  learning  system. 

For  example,  if  a  parameter  that  has  a  nonlocal  effect  on  the  mapping  is  repeat¬ 
edly  adjusted  to  correct  the  mapping  in  a  particular  region  of  the  input  domain, 
this  may  cause  the  mappiiig  in  other  regions  to  deteriorate  and,  thus,  can  effec¬ 
tively  "erase"  learning  that  has  previously  taken  place.  Such  undesirable  behav¬ 
ior  arises  because  the  parameter  adjustments  dictated  by  an  incremental  learn¬ 
ing  algoiithm  are  made  on  the  basis  of  a  single  evaluation  point,  without  regard 
to  the  remainder  of  the  mapping.  Another  unfortunate  phenomenon  is  inherent 
in  all  incremental  learning  algorithms:  conflicting  demands  on  the  adjustable 
parameters  aie  created  because,  for  instance,  the  vector  p*  that  minimizes  J  in 
(3.5)  at  some  point  z.,  will  generally  differ  from  the  vector  p*  that  minimize.^  this 
function  at  some  other  point  z^.  The  idiosyncrasies  associated  with  passive  in¬ 
cremental  learning  in  closed  loop  control  (i.e.,  stasis  coupled  with  nonlocal  learn¬ 
ing,  and  conflicting  pareimeter  updates),  have  prv,cipitated  the  development  and 
analysis  of  spatially  localized  learning  systems. 

The  basic  idea  underlying  spatially  localized  learning  arises  from  the  observation 
that  learning  is  facilitated  in  situations  where  a  cleai  association  can  be  made  be¬ 
tween  a  subset  of  the  adjustable  elements  of  the  learning  system  and  a  localized 
region  of  the  input-space.  Further  consideration  of  this  point  in  the  context  of  the 
difficulties  described  above,  suggests  several  desired  traits  for  learning  sy,stem3 
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that  rely  on  incremental  gradient  learning  algorithms.  These  objectives  can  be 
expressed  in  terms  of  the  previously  mentioned  "sensitivity"  functions  dMJdpj, 
which  are  the  partial  derivatives  of  the  mapping  outputs  M,  with  respect  to  the 
adjustable  parameters  pj.  At  each  point  x  in  the  input  domain  of  the  mapping,  it 
is  desired  that  the  following  properties  hold: 

•  for  each  M-,  there  exists  at  least  one  pj  such  that  the  function  ^dMJdpj\  is 
relatively  large  in  the  vicinity  of  x  {coverage) 

•  for  all  Af,  and  p^  ,  if  the  function  ^dM-jdpj^  is  relatively  large  in  the  vicinity  of 
X ,  then  it  should  be  relatively  small  elsewhere  {localization) 

Under  these  conditions,  incremental  gradient  learning  is  supported  throughout 
the  input  domain  of  the  mapping,  but  its  effects  are  limited  to  the  local  region  in 
the  vicinity  of  each  learning  point.  Thus,  experience  and  consequent  learning  in 
one  part  of  the  input  domain  have  only  a  marginal  effect  on  the  knowledge  that 
has  already  been  accrued  in  other  parts  of  the  mapping.  For  similar  reasons, 
problems  due  to  conflicting  demands  on  the  adjustable  parameters  are  also  re¬ 
duced. 

Several  existing  learning  system  designs,  including  BOXES  [Michie  &  Chambers 
(1968)],  CMAC  [Albus  (1975)],  radial  basis  function  networks  [Poggio  &  Girosi 
(1990)],  and  loca)  basis/influence  function  networks  [Baker  &  Farrell  (1990); 
Millington  (1991)],  generally  do  exhibit  the  spatially  localized  learning  property. 
In  contrast,  the  ubiquitous  sigmoidal  (or  perceptron)  network  often  does  not  ex¬ 
hibit  this  property.  To  combat  the  problems  associated  with  nonlocalized  learning 
and  conflicting  parameter  updates,  a  number  of  simple  corrective  procedures 
have  been  used  with  sigmoidal  networks,  including  local  batch  learning,  very 
slow  learning  rates,  distributed  (uncorrelated)  input  sequences,  and  randomizing 
input  buffers  (e.g.,  see  [Baird  &  Baker  (1990)]). 

To  give  a  simple  example  of  spatially  localized  learning,  we  will  briefly  describe 
*  local  basis/influence  function  networks  and,  in  particular,  the  linear-Gaussian 

network.  This  approach  relies  on  a  combination  of  local  basis  and  influence  func¬ 
tion  nodal  units  to  .'’■■hieve  a  compromise  between  the  spatially  localized  learning 
properties  of  quantized  learning  systems  (e.g.,  those  based  on  "bins")  and  the  efli 
cient  representatmn  and  generalization  capabilities  of  other  connectionist  net¬ 
works.  The  complete  network  mapping  is  constructi  d  from  a  set  of  local  basis 
functions  f,(x)  that  have  applicability  only  over  spatially  localized  regions  of  the 
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input-space.  The  influence  functions  y^(x)  are  coupled  in  a  one-to-one  fashion 
with  the  basis  functions  f,(x),  and  are  used  to  describe  the  domain  over  the  input- 
space  (the  "sphere  of  influence")  of  each  local  basis  function.  In  other  words,  rel¬ 
ative  to  some  point  x°  in  the  input  domain,  each  influence  function  y.(x)  is  de¬ 
fined  as  a  nonnegative  function,  with  a  maximum  at  x” ,  that  tends  to  zero  for  all 
points  X  that  are  "far  away"  fi-om  x“.  The  overall  input/output  relationship  is 
given  by 

y(x)  =  ^r,(x)f,{x)  (3.7a) 

i=l 

where  n  is  the  number  of  nodes  in  the  network  and  ri(x)  are  the  normalized  in¬ 
fluence  fiinctions,  defined  to  be 

=  with  0<r.(x)<l  and  ^r..(x)=l  (3.7b) 

Zr.W 

By  design,  each  adjustable  param<iter  in  this  network  affects  the  overall  mapping 
only  ever  the  limited  region  of  its  inp  t-space  described  by  the  associated  (normal¬ 
ize  1)  influence  fui.  ■  Jon,  Thus,  the  .forementioned  stasis  problem  is  minimized. 
Nr*  also  that  (local)  generalize  tion  is  an  inherent  property  of  the  network,  and 
t  izaridard  incremental  g  adient  earning  methods  can  still  be  used. 

To  further  illustrate  the  b.‘. sic  con^.jpt,  we  will  con.ider  a  specific  realiiaaca  em¬ 
ploy  ^ig  linear  functions  (v  uh  n  offset,  so  that  they  are  really  affine)  as  th  :  local 
basis  units,  and  Gaussian  functions  as  the  influence  fimetion  units.  Ir.  tl  is  lin- 
r -Gaussian  network  ,  thi;  functions  f,(x)  and  y,(x)  are  defined  to  ne; 

f,(x)  =  M.(x-x:)  +  b, 
y,(x)  =  c,  exp{-(x  -  x;)'q,(x-  x;)} 

where,  for  each  node  p  lir  t  in  the  network,  the  matrices  M,  and  Q,  ,  the  vectors 
X®  and  b,,  a;.d  the  sea  ar  c,  are  all  potentially  adjustable  (Q,  must  be  symmetric 
positive  definite).  The  ector  x“  represents  the  local  origin  shared  by  the  hnear- 
Gaussian  pair,  the  idea  being  that  the  overall  mapping  is  approximated  by  f,^»)  in 
the  "vicinity"  of  x"  (ao  characterized  by  r,(x),  relative  to  all  other  r^,,(x)).  Be¬ 
cause  of  its  unique  structure,  physical  meaning  is  more  easily  attributed  to  each 
parameter  and  to  the  ove^rall  structure  of  the  netwo  'k.  As  a  result,  n  priori  knowl¬ 
edge  and  partial  solutions  are  easily  incorporo  ed  le.g.,  linear  control  point 
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designs  corresponding  to  the  fi(x)).  In  fact,  linear  functions  were  chosen  as  the 
local  basis  units  due  to  their  simplicity  and  compatibility  with  conventional  gain 
scheduled  mappings  (alternative  local  basis  units  may  be  more  desirable  if  cer¬ 
tain  a  priori  knowledge  is  available  aoout  the  regional  functional  structure  of  the 
desired  mapping).  Due  to  its  special  structure,  this  network  also  allows  on-line 
variable  structure  learning  schemes  to  be  used,  where  nodal  unit  pairs  can  be 
added  or  removed  from  the  network  to  achieve  more  accurate  or  more  efficient 
mappings.  An  example  of  a  simple  linear-Gaussian  network  comprised  of  5 
pairs  of  local  basis/influence  function  units  is  shown  in  Fig.  3.3;  the  influence 
functions  (lower  part  of  the  figure)  have  been  separated  from  each  other  some¬ 
what  so  that  each  of  the  local  linear  functions  is  clearly  visible  in  the  overall  in¬ 
put/output  mapping  (upper  part  of  the  figure). 


Figure  3.3,  An  Example  of  a  Linear-Gaussian  Network  Mapping  (9i^  9i), 

Together  with  Its  Underlying  Influence  Functions. 

Learning  algorithms  for  spatially  localized  networks  can  capitalize  on  localiza¬ 
tion  in  two  way.s.  First,  spatial  localization  mplies  that  at  each  instant  of  time 
only  a  small  subset  of  the  nodal  units  (and  hence  a  small  subset  of  the  adjustable 
parameter^  have  a  significant  efiect  on  the  network  mapping.  Thus,  the  effi¬ 
ciency  of  both  calculating  the  network  outputs  and  of  updating  the  network  pa¬ 
rameters  can  be  improved  by  ignoring  all  "insignificant"  nodal  units.  For  exam¬ 
ple,  tins  can  he  re;ili/ed  in  a  linear-Gaussian  I'.etu'ork  by  utilizing  only  those 


nodal  unit  pairs  with  the  largest  normalized  influences;  that  is,  those  whose 
combined  (normalized)  influence  equals  or  exceeds  some  predefined  threshold 
(e.g.,  0.95).  This  approach  can  greatly  increase  the  throughput  of  a  network  when 
implemented  in  sequential  computational  hardware.  Furthermore,  since  the 
system  state  may  remain  in  particular  regions  of  its  state-space  for  extended 
periods  of  time,  it  is  .  xpected  that  the  approximation  error  will  not  tend  uniformly 
to  zero.  Instead,  the  e^Tor  will  be  lowest  in  those  areas  where  the  greatest  amount 
of  learning  has  occurred.  This  leads  to  conflicting  constraints  on  th^  learning 
rate;  it  should  be  smail,  to  filter  the  effects  of  noise,  in  those  regions  where  the  ap¬ 
proximation  error  i'"  jmall;  at  the  same  time,  it  should  be  larger,  for  fast  learn¬ 
ing,  in  those  regions  where  the  approximation  error  is  large  (relative  to  the 
ambient  noise  level).  Resolution  of  this  conflict  is  possible  through  the  use  of 
spatially  localized  learning  rates,  where  individual  learning  rate  coefficients  are 
maintained  for  each  (spatially  localized)  adjustable  parameter  and  updated  in  re¬ 
sponse  to  the  local  learning  conditions.  In  this  case,  the  elements  of  the  weight¬ 
ing  matrix  W  would  vary  individually  over  time. 

The  computational  memory  requirements  for  spatially  localized  networks  fall 
somewhere  between  those  for  nonlocal  connectionist  networks  (on  the  low  side) 
and  those  for  discrete-input,  analog-output  mapping  architectures  (on  the  high 
side).  By  requiring  each  parameter  to  have  only  a  localized  effect  on  the  overall 
mapping,  we  should  expect  an  increase  in  the  number  of  parameters  required  to 
obtain  a  mapping  comparable  in  accuracy  to  a  (potentially  more  efficient)  non-lo¬ 
cal  technique.  Nevertheless,  for  automatic  control  applications,  training  speed 
and  approximation  accuracy  should  have  priority  over  memory  requirements, 
since  memory  is  generally  inexpensive  relative  to  the  cost  of  inaccurate  or  inap- 
prupriate  control  actions. 

3.6  Application  Issues 

One  strategy  for  using  learning  in  flight  control  involves  the  development  of  an 
off  line  flight  control  system  design  that  is  optimized  (via  adaptive  and  learning 
augmentation)  relative  to  a  design  model,  followed  by  in  flight  evaluation  and 
subsequent  on-line  tuning  relative  to  the  actual  vehicle.  This  requires  that  a 
number  of  key  design  issues  be  resolved.  Such  issues  are  identified  and  briefly 
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discussed  below,  under  the  assumption  that  a  hybrid  adaptive/leaming  control 
approach  will  be  used. 

Performance  Requirements 

As  part  of  any  control  design,  the  desired  transient  and  steady-state  dynamical 
characteristics  of  the  vehicle  must  be  specified.  For  a  complex  system  with  signif¬ 
icant  nonlinearity  and  modeling  uncertainty,  the  specification  of  performance  re¬ 
quirements  that  are  achievable  throughout  the  flight  envelope  may  be  a  non-triv- 
ial  matter.  It  will  almost  certainly  be  the  case  that  the  vehicle  is  fundamentally 
capable  of  "delivering  more"  in  some  flight  conditions  than  in  others.  Unfortu¬ 
nately,  determining  what  these  innate  advantages  are  and  how  they  can  be  ex¬ 
ploited  may  not  be  completely  evident  until  after  in-flight  testing  with  the  actual 
vehicle  has  begun.  On-line  learning  could  be  used  to  optimize  the  closed-loop  per¬ 
formance  of  the  vehicle  over  the  entire  envelope. 

A  Priori  Control  Law  Design 

A  basic  control  law  for  the  flight  control  system  must  be  selected  that  is  expected 
to  be  able  to  achieve  the  desired  performance  requirements.  This  involves  the  se¬ 
lection  of  the  measurements  to  be  used  for  feedback,  the  type  of  control  law  struc¬ 
ture  to  be  used  (e.g.,  a  simple  linear  combination  of  the  feedback  variables  or  dy¬ 
namic  compensation),  and  a  nominal  set  of  control  law  parameters  (e.g.,  feedback 
gains).  If  a  nominal  gain  scheduled  controller  already  exists,  then  this  design 
could  be  incorporated  as  a  priori  knowledge  in  a  suitable  learning  system  [Baker 
&  Farrell  (1992)].  Importantly,  the  a  priori  control  law  structure  should  be  flexible 
enough  so  that  both  adaptive  and  learning  augmentation  car  be  used  Finally,  an 
estimator/observer  may  be  needed  to  synthesize  and/or  filter  the  state  estimates. 

Adaptive  System 

An  adaptive  control  system  must  be  selected  that  is  an  extension  of  the  n  prior 
control  la  w  design;  that  is,  the  adaptive  design  should  be  such  that,  if  there  were 
no  modeling  error,  the  adaptive  system  would  contribute  nothing.  Either  a  direct 
or  indirect  adaptive  control  system  might  be  used.  The  ability  of  the  hybrid  system 


to  be  robust  to  transients  caused  by  uncertainty  is  determined  primarily  by  the  a 
priori  and  adaptive  components  of  the  control  system. 

Posterior  Estimator 

Depending  on  whether  a  direct  or  indirect  adaptive  control  system  has  been  se* 
lected,  a  current  estimator  will  exist  for  the  adjustable  control  or  model  parame¬ 
ters,  respectively.  Since  the  learning  process  itself  need  not  be  performed  in 
phase  with,  ot  even  at  the  same  pace  as,  the  control  update  cycle,  and  since  sig¬ 
nificantly  improved  estimates  can  be  achieved  by  using  delayed  estimation  meth¬ 
ods  (e.g.,  combinations  of  both  filtering  and  smoothing),  a  posterior  estimator 
must  be  developed  and  used  to  provide  relatively  high  quality  information  to  the 
learning  system. 

Learning  System 

The  detailed  structure  and  parameter  update  algorithm(s)  of  the  learning  system 
must  be  selected.  Tliese  choices  are  mediated  by  the  anticipated  characteristics 
(e.g.,  number  of  inputs,  number  of  outputs,  and  complexity)  of  the  functional 
mapping(s)  to  be  learned.  If,  for  instance,  a  conventional  indirect  adaptive  con¬ 
troller  is  usee  that  explicitly  estimates  model  parameters  on-line,  and  if  it  can  be 
safely  assumed  that  the  significant  spatial  dependencies  in  the  dynami''al  behav¬ 
ior  of  the  vehicle  are  functions  of  a  subset  of  the  vehicle  state  (e.g.,  altitude,  speed, 
and  angle- of-attack.',  then  a  mapping  from  these  scheduling  variables  to  the 
model  param3ters  is  required.  Any  a  priori  knowledge  of  the  fimctional  relation¬ 
ships  between  the  inputs  and  outputs  of  the  desired  mapping  could  be  exploited  by 
selecting  an  appropriate  initial  structure  and  parameter  set. 

Training  Procedure 

The  development  of  an  appropriate  (off-line)  training  procedure  is  an  important 
subtask.  The  trajectories  and  initial  conditions  used  for  training  purposes  must 
provide  sufficient  excitatioa  of  the  vehicle  dynamical  characteristics  and  must 
adequately  explore  the  specified  vehicle  operating  envelope.  An  additional  issue 
concerns  the  use  of  a  stochastic  vehicle  model  (to  approximate  a  range  of  dynami¬ 
cal  behaviors)  during  training.  It  is  possible  that  such  stochastic  behavior  will 
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caiisse  the  off-line  learning  process  to  be  loss  sensitive  to  slight  modeling  srrors,  at 
the  expense,  however,  of  total  training  time  and  initial  closed-loop  system  per¬ 
formance. 

Evaluation  Procedure 

In  normal  practice,  a  design  model  is  used  to  support  the  flight  control  system 
design,  development,  and  preflight  evaluation,  wliile  the  actual  vehicle  is  used  to 
evaluate  the  in-flight  performance  of  the  closed-loop  system.  Thus,  an  off-line 
evifluation  of  the  robust  capability  of  any  approach  can  only  be  made  relative  to  a 
simulation  that  accounts  for  the  expected  differences  (due  to  uncertainty)  between 
the  "design"  model  and  the  'actual"  vehicle,  To  fully  demonstrate  the  potential 
benefits  of  a  hybrid  approach,  in  particular  the  on-line  tuning  and  performance 
enhancement  capabilities,  an  off-line  evaluation  procedure  would  have  to  allow 
sufficient  time  for  "in-flight"  interaction  with  the  actual  vehicle. 

3.7  Expected  Benefits 

Potential  benefits  that  may  be  accorded  by  learning  augmentation  are  described 
below 

Control  System.  Design  and  Tuning 

l-eaming  augmentation  may  facilitate  the  design  and  tuning  of  flight  control  sys¬ 
tems  for  high  performance  aircraft  in  several  ways.  Learning  systems  can  pro¬ 
vide  design  avitomation;  they  can  allow  known  nonlineaiities  and  spatial  depen¬ 
dencies  to  be  compensated  for  directly  and,  in  addition,  they  can  provide  a  means 
for  off-line  flight  control  system  decign  optimization.  These  benefits  may  result  in 
less  manual  involvement  during  the  initial  de.sign  phase  (to  achieve  a  specified 
level  of  performance);  in  addition,  a  smaller  number  of  test  flights  (and  associated 
tuning)  may  be  required  for  similar  reasons.  The  net  effect  is  a  reduction  in  effort 
and,  consequently,  a  reduction  in  cost.  Loanring  could  also  play  a  corresponding 
role  in  the  retrofitting  of  advanced  flight  control  systems  into  existing  high  per¬ 
formance  aircraft. 
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On-Line  Accommodation  of  U^icertainiy 

Learning  augmentation  can  provide  an  on-line  approach  for  accommodating  both 
parametric  and  structural  model  un certainty.  This  is  in  contrast  to  the  standard 
off-line  design  approach  using  ^xed-para meter,  robust  control  designs.  With  on¬ 
line  learning,  the  level  of  uncertainty  may  be  reduced  through  direct,  closed-loop 
interactions  with  the  vehicle  and  its  environment  to  achieve  a  posteriori  levels  of 
uncertainty  that  are  substantiaHy  lower  than  a  priori  ones.  With  the  off-line  ap¬ 
proach,  closed-loop  system  performance  is  generally  sacrificed  to  ensure  that  the 
vehicle  satisfies  some  minimum  level  of  performance  requirements  for  all  likely 
variations  of  the  model  parameters.  The  tradeoff  between  performance  and  ro¬ 
bustness  may  be  especially  severe  for  veMcles  that  are  required  to  operate  with  a 
vfiriety  of  payloads  or  configurations.  The  ability  to  learn  on-line  can  reduce  the  a 
priori  robustness  reqmred,  allowing  the  designer  to  obtain  a  higher  level  of  overall 
closed-loop  system  performance.  Depending  on  type  of  learning  system  used, 
it  may  be  possible  to  initialize  the  system  with  an  n  priori  robust  control  design, 
and  allow  on-iirs  learning  to  improve  this  nominal  design  as  the  level  of 
uncertainty  is  reduced  through  direct  interaction  with  the  vehicle. 

Closed-Loop  System  Performance 

Learning  augmentation  can  prortde  an  automatic  mechanism  for  improving  the 
level  of  closed-loop  system  performance  that  is  ultimately  achieved,  through  on¬ 
line  self-optimization.  This,  together  with  the  xact  that  learning  sy.stems  are  ca¬ 
pable  of  realizing  general  multivariable  functional  mappings,  means  that  ini¬ 
tially  untapped  vehicle  superiorities  (e.g.,  agility)  might  be  exploited  to  provide  en¬ 
hanced  maneuverability. 

Operational  Efficiency 

Since  on-line  learning  can  be  used  to  accommodate  initially  unknown  spatial  de- 
pendencieG,  some  transient  effects  that  would  otherwise  be  associated  with  pa¬ 
rameter  adjustment  in  an  entirely  adaptive  system  can  be  minimized  to  improve 
the  operational  efficiency  and  performance  of  a  hybrid  control  system.  On-line 
learning  can  reduce  the  burden  on  the  adaptive  system  of  continuously  reacting  to 
predictable  nonlinearit’cs. 


4  Hybrid  Adapt!ve/Leai:imigCk>nt^ 


A  specific  indirect  hybrid  adaptive/leaming  control  methodology  is  derived  math¬ 
ematically  in  this  chapter.  The  motivation  for  this  derivation  comes  fiom  our  ear¬ 
lier  analysis  of  different  learning  control  system  architectures  and  from  a  consid¬ 
eration  of  the  benefits  of  exploiting  a  priori  design  knowledge  and  of  utilizing  on¬ 
line  adaptation.  The  discussion  will  cover  three  different  controllers  that  are  each 
based  on  model-reference  compensators  of  increasing  sophistication:  a  simple 
linvjar  compensator,  an  adaptive  version  of  the  linear  compensator,  and  a  hybrid 
adaptive/learning  veision  of  the  linear  compensator.  Thus,  the  overall  learning 
augmented  control  system  will  be  derived  by  enhancing  a  simple  linear  com¬ 
pensator  with  both  adaptive  and  learning  capabilities. 

Tlie  model-based  linear  compensator  was  designed  following  a  procedure  similar 
to  that  in  (Anderson  &  Schmidt  (1991)].  We  further  developed  this  basic  approach 
so  that  it  might  be  applied  to  nonlinear  problems.  Subsequently,  an  adaptive 
compensator  was  developed  by  incorporating  and  extending  ideas  presented  in 
[Youcef-Toumi  &  Ito  (1990)].  Finally,  a  hybrid  adaptiveAearning  control  system 
was  developed  by  combining  the  same  adaptive  compensator  with  a  learning  sys¬ 
tem,  and  treating  the  entire  problem  in  a  nonlinear  framework.  The  detailed 
mathematicad  derivation  is  presented  in  Section  4.1,  while  several  applications  of 
this  control  methodology  are  summarized  in  Section  4.2. 

4. 1  Controller  Development 

In  the  following  derivation,  we  will  assume  that  the  plant  d3mamics  are  given  in 
discrete-time  by 

y*  =h(x*) 

where  x,u,y  represent  the  states,  inputs,  and  outputs,  respectively,  of  the  plant, 
and  (iim(x)  =  n ,  dim(u)  =  m ,  and  dim(y)  =  m .  For  the  purpose  of  deriving  the 
baseline  linear  compensator,  we  will  assume  that  these  equations  have  been  lin¬ 
earized  about  an  equilibrium  point  to  yield  a  local  approximation  of  the  form 


(4.1) 
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(4.2) 


y*  =  Cx* 

In  fact,  we  will  limit  the  amount  of  a  priori  knowledge  of  the  actual  plant  dynam¬ 
ics  (4.1)  that  is  available  for  the  design  of  all  three  controllers  (linear,  adaptive, 
and  hybrid)  to  a  single,  constant  parameter,  linear  model  of  the  form  given  in 
(4.2).  Note  that  this  is  not  a  requirement  of  our  approach — a  nonlinear  a  priori 
model  of  the  form  given  by  (4.1)  could  be  used  without  difficulty.  In  the  derivations 
that  follow,  we  chose  to  use  linear  systems  to  describe  the  reference  model  and  de¬ 
sired  output  tracking  ei*ror  dynamics,  although  once  again  there  is  no  real  re¬ 
quirement  to  do  so. 


The  development  of  the  three  compensators  begins  with  the  specification  of  two 
more  dynamical  systems;  (i)  a  model  of  the  desired  closed-loop  dynamics  of  the 
plant  (the  reference  model)  and  (ii)  a  model  of  the  desired  dynamics  of  the  differ¬ 
ence  between  the  outputs  of  the  reference  model  and  those  of  the  actual  plant  (the 
tracking  error  dynamics).  The  output  tracking  error  e  is  defined  as  e  =  y,.-y, 
where  y,  are  the  outputs  of  the  reference  model,  y  are  the  outputs  of  the  plant, 
and  dim(e)  =  m  and  dim(y,)  =  m.  The  desired  tracking  error  dynamics  are  de¬ 
fined  as 


(4.3) 


where  is  a  matrix  with  all  eigenvalues  inside  the  unit  circle  (i.e.,  so  that  (4.3)  is 
a  stable  system).  Other  forms  of  the  error  dynamics  (4.3)  can  also  be  pursued 
(e.g.,  by  the  addition  of  integral  action).  Finally,  we  will  assume  that  the  desired 
reference  model  is  also  linear  and  stable: 

y^.*  =  Cr^r.k 

where  x,,  and  r  represent  the  states  and  inputs,  respectively,  of  the  reference 
model,  and  dim(xj  =  p  and  dim{r)  =  m. 


(4.4) 


4.1.1  Baseline  Linear  Compensator 


The  linearized  plant  model  can  be  expressed  in  an  input-outpxit  form  by  collaps¬ 
ing  the  state-space  description  of  (4.2): 

y*,,:=Ca>x,-rCru,  (4.5) 
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A  similar  operation  can  be  performed  to  produce  an  input-output  description  of 
the  reference  model: 


YrMl  ~  ^r^r^r.k  (4-6) 

Now,  given  (4.3),  (4.5),  and  (4.6),  it  is  possible  to  summarize  the  control  objective: 
assuming  (4.2)  is  minimum  phase  (the  possibility  that  it  is  not  is  discussed  be¬ 
low),  we  want  to  pick  u*  so  that  (if  possible)  the  following  equation  holds 

=yr.*.i-y*.i  (4.7) 


where  the  quantity  =yr.k*i~^e^k  becomes  the  overall  desired  response  of 

the  aircraft  at  time  ^  + 1  to  the  applied  controls  u* .  Thus,  at  each  time  k  we  seek 
a  Hi,  that  satisfies 

*  -i-  -  C«I»x*  -  cru*  (4.8) 

Without  much  difficulty,  a  control  law  can  be  derived  in  a  manner  similar  to  that 
outlined  in  [Anderson  &  Schmidt  (1991)1  to  achieve  the  objective  specified  by  (4.7). 
If  the  matrix  (CF)  is  invertible,  then  a  potential  compensator  that  would  result  in 
perfect  output  tracking  is  given  by  (4.4)  and 

(4.9) 

Note  that  (4.4)  and  (4.9)  together  represent  a  linear,  multivariable,  model-based 
compensator,  requiring  full-state  measurements  (or  the  use  of  a  state  observer). 


It  must  be  stressed,  however,  that  even  if  (CF)  is  nonsingular,  this  compensator 
cannot  be  used  if  the  zero  dynamics  of  the  open-loop  system  are  unstable,  as  this 
would  be  tantamount  to  attempting  to  cancel  a  nonminimum  phase  zero  [Slotine 
&  Li  (1991)].  In  many  applications,  it  is  known  from  the  basic  physics  of  the  prob¬ 
lem  that  the  plant  does  not  have  any  unstable  transmission  zeros,  and  so  this  is 
not  an  issue.  In  general,  though,  this  problem  can  be  addressed  by  choosing  al¬ 
ternative  tracking  outputs  or  by  utilizing  a  more  sophisticated  on-line  control  de¬ 
sign  technique  (see  below). 


Noninput  i  Output  Square  Systems 


If  (CF)  is  singular  (e.g,,  because  the  system  (4.5)  is  not  input-output  square),  then 
the  previous  compensator  (4.9)  cai;  be  modified  by  using  the  following  relation 

u*  =  (Crr[y ,  ,  -  C4>x*  -  <1>  e* ]  (4. 10) 
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where  (•)"^  denotes  the  pseudo-inverse  of  the  argument  and  M*  s  ^PF.  In 

this  case,  perfect  tracking  is  not  generally  possible;  nevertheless,  it  can  be  shown 
that  (4.10)  is  the  "best"  choice  for  u*  in  the  sense  that  it  minimizes  the  Euclidean 
norm 

j|y  y...|  (4.11) 

Similar  developments  exist  for  cases  where  there  are  more  controls  than  outputs, 
or  where  tracking  error  is  tc  baianc-^d  against  control  usage  (see  below). 


Infinite-Horizon  LQ  Fort  ..  tion 


As  an  alternative  to  tr  e  c  irect  ^sign  approac  h  outlined  bi  one  could  instead 
s 'tup  and  solve  the  folic  ing  linear-quadrati  regi  la  ir  iLlq.  ')  design  problem 
{whei>  J  repre'^ents  th^  rot  t  to  be  minim  "i)  ,'lvude  Schmidt  (1991)]: 

'  2  X[(®**1  ■■  ®*)‘ 

subject  to  the  cor  straim  equations 

1 

0  ^  1  .  O  n 

T  !  '•  * 

0  ^  «>,,  r*  J  ^0 


^r.*  +  l 


‘■*  +  1 


where  r  prese  a  fictitious  r “n  e  mmand  shapi  q  filter  that  does  not 

appear  in  t'le  fin*,  coniro)  la\  rn  Q  an  i  .ire  assum  1  t  e  symmetric  posi¬ 
tive  deiinile  rna  r.  '.s,  1  e  coss  a  on  '  it:  motiva  ed  froi  i  the  desire  to  simul 
tan  ;ou  ilv  ach  'V  ne  track  ng  eri  rpN  oa  ics  (4.3)  and  at  hi  same  time  penalize 
the  :>pp  id  Cl  ■  ">is  Ti  e  ici  iti  u  t  .  s  optimization  problem  3deld3  the  neces- 
lar  £'  .  s  mr  r  (ponding  to  the  state  feedback  f  om  the  ph  r  as  well  as  the  feed- 
aid  ten  11  om  the  n  h  i  i  once  systen  a  t  xogt  luus  input.  Although 

S  ir  approac.  i  r  i  ives  g.  ea.er  . a  rn^  itation,  the  (ihu  on  is  guaranteeii  to  be  stable 
in  tne  linear  ca  e  inc i  ir  very  mild  assumptions  jils  -on  &  Ho  (1975);  Maciejowski 
(1989);  t>tv‘V(  !i  4;  Lewia  f]992  ].  Since  i  ni  unimum  phase  characteristics  were 
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not  generally  an  issue  for  us,  we  chose  to  use  the  substantially  simpler  on-line 
design  method  given  by  (4.9). 

One-Step  LQ  Formulation 

Note  also  that  a  one-step  optimization  involving  both  output  and  control  weightir  ^ 
is  possible.  For  example,  if  the  cost  to  be  minimized  over  the  next  time-step  is 

=  1  (e*.,  -  O.e*  )"Q  uJ  Ru* 

then  the  optimal  solution  is  given  by  (note  the  similarity  to  (4.9)  and  (4.19)) 

u.  =  [r  +  (Cr)’Q  {Cr)j'‘(Cr)’Q  [y  -  C<1«.  -  ®,e.  j 

Related  developments  are  presented  on  pp.  91-101,  of  Attachment  3. 

LLZ _ Baseline  Adaptive  Compensator 

To  address  the  fact  that  there  may  be  modeling  error  (i.e.,  that  the  a  priori  design 
model  given  by  (4.2)  may  be  quite  different  from  the  actual  plant  dynair'  is  given 
by  (4.1)),  adaptation  can  be  incorporated  into  the  operation  of  the  linear  compen¬ 
sator.  A  simple  technique  for  achieving  such  adaptation  is  describeo  in  [Youcef- 
Toumi  &  Ivo  (1990)].  The  baseline  linear  compensator  described  above  can  be  mod¬ 
ified  to  include  this  capability  as  follows.  First,  we  assume  the  design  model  for 
the  plant  is 

y*,,  =C<«»x,-eCru*+^*  (4.12) 

where  represents  the  unmodeled  behavior  at  time  k.  Following  [Youcef- 
Toumi  &  Ito  (1990)],  a  simple  adaptive  estimate  for  the  unmodeled  nonlinearities 
can  be  generated  by  solving  (4.12)  at  the  previous  time  index  for  and  then 
by  issmning  that  Y*  -  tliis  yields 

'F*  =y*-C4>x*_, -cru*_j  (4.13) 

Given  this  estimate  for  the  unmodeled  behavior,  an  adaptive  compensator  can  be 
constructed  in  exactly  the  same  way  as  (4.9)  was  derivf  1.  Simple  algebraic  ma¬ 
nipulation  yieldt 


(4.14) 


Note  that  the  reference  model  update  equations  (4.4)  do  not  change  in  this  deriva¬ 
tion,  and  hence  are  not  repeated.  The  adaptive  analog  of  (4.10)  that  minimizes  the 
norm  (4.i  1)  can  be  similarly  derived. 

Clearly,  the  adaptive  mechanism  described  by  (4.13)  is  very  crude  and  cannot  be 
used  to  address  changes  in  control  effectiveness  {dgjdw)  in  the  actual  plant,  as  a 
function  of  either  x  or  u  (here,  the  function  g(-, )  is  defined  as  the  composition  of 
the  factions  h()  and  f(-,  )  from  (1);  i.e.  gsh»f).  Moreover,  the  time-delay  es¬ 
timate  (4.13)  is  susceptible  to  the  presence  of  noise  (or  similar  rapidly  varying  in¬ 
fluences).  Despi  e  these  drawbacks,  which  can  be  somewhat  ameliorated  by  uti¬ 
lizing  a  small  time  step  (faster  sampling)  and  a  low-pass  filter  of  the  estimate 
(4.13),  simple  adaptive  compensators  of  this  form  have  performed  remarkably 
well  in  complex  nonlinear  simulations  (e.g.,  [Millington  &  B.Aker  (1992)])  involv¬ 
ing  process  disturbances,  sensor  noise,  and  unmodeled  aero  ,  engine,  and  actua¬ 
tor  dynamics. 

4.1.3  Hybrid  Adaptive/Leaming  Compensator 

Assuming  no  additional  unmodeled  state-s,  the  a  prioH  modeling  err  or  'iP^  will 
generally  be  a  function  of  both  the  current  states  x*  and  applied  controls  u^.  In 
the  simple  time-delay  estimate  (4.13)  used  by  the  adaptive  compensator,  the  un- 
modeled  effect  of  the  current  control  vector  u*  on  the  subsequent  output  vector 
is  not  addr  'ssed  in  the  determination  of  u^.  This  implies  that  the  adaptive 
compensator  is  not  well  equipped  to  handle  significant  modeling  error  involving 
the  effectiveness  (<?g/<^u)  of  the  control  \ariables.  In  flight  applications,  control 
effectiveness  can  change  dreimatically  as  a  function  (primai  sly)  of  dynamic  pres¬ 
sure  and  vehicle  attitude  relative  to  the  incident  wixid. 

To  accommodate  such  nonlinear  effects,  as  well  as  other  errors  due  to  the  time- 
delay  approximation,  a  learning  system  can  be  used  to  synthesize  (on-line)  the 
functional  mapping  desenb  -d  by  the  unmodeled  nonlinearities,  in  terms  of  the 
key  plant  staves,  envii  onmental  variables,  and  applied  controls.  The  use  of  learn¬ 
ing  systems  m  this  stating  is  further  motivated  and  described  in  [Baker  S-  Farrell 
(1991);  Baker  &  Farrell  (1992);  Millington  &  Raker  (1992)]. 


70 


The  hybrid  compensator  can  be  derived  from  the  basabne  adaptive  compensator  by 
assuming  that  the  design  model  for  the  plant  has  the  form 

=C<I>Xi-i-Cru*+ii(x*,u*)  +  'F*  (4.15) 

where  n(x*,u*)  represents  initially  unmodeled  nonlinear  behavior  that  will  be 
"learned"  by  a  network  approximation,  as  a  function  of  x*  and  u* .  The  vector 
represents  any  residual  nonlinear  behavior  not  captured  by  the  a  priori  model 
(4.2)  or  through  learning  augmentation.  The  network  mapping  is  implicitly  de¬ 
pendent  on  time  for  the  simple  reason  that  it  is  evolving  as  a  result  of  learning  ac¬ 
tion.  As  before,  a  simple  adaptive  estimate  for  the  unmodeled  nonlinearities 
can  be  generated  by  solving  (4.15)  at  the  pre\nous  time  index  for  and  assum¬ 
ing  that  this  yields 

^*  =  y*  -  C«I>x*.i  -  cru*_i  -  n(x*.i, u*.i)  (4.16) 

Unlike  the  previous  cases,  the  control  objective  (4.7)  cannot  be  solved  directly  be¬ 
cause  it  is  nonlinear  in  u*,  due  to  the  presence  of  the  term  n(x4,u*).  One  solution 
to  this  nonlinear  programming  problem  is  to  use  an  iterative  Newton-Raphaon 
technique  [Press,  et  al.  (1988)]  to  5nd  u*.  I.inearizing  (4.15)  about  the  point  u*_i 
yields 


dxi 

y*-i  “  +  n(xt, u*_j) 4-  --  •  (u*  -  u*.i)  4-  % 


(4.17) 


where  (<9n/c^u)  is  the  Jacobian  matrix  of  n{  , )  with  respect  to  the  argument  u, 
evaluated  at  |x^,U;^_j}.  Using  this  approximation  (in  wliich  u*  appears  linearly), 
it  is  possible  to  compute  the  first  Newton-Raphson  estimate 


u 


*  ~ 


-1  - 

cr+~ 

an 


yr,**i  -  C<I>x*  -  n(x„  u*  .1)4  —  •  u*_j 


(4.18) 


assuming  that  the  indicated  matrix  inversion  is  possible;  if  the  matrix  is  singu¬ 
lar,  then  a  pseudo-inverse  can  be  used  instead.  Subsequent  estimates  are  ob¬ 
tained  by  relinearizing  (4.15)  about  the  points  '  In  our  work,  we  have  found 
that  the  estimate  obtained  after  the  first  iteration  is  sufficiently  accurate,  given 
that;  u*_i  is  used  as  the  initial  guess,  the  discrete  tiio.-  step  is  small,  and  that  the 
admissible  change  in  the  controls  Au*  u*  j  is  fundamentally  limited  by  ac¬ 

tuator  rate  limits.  For  these  reasons,  we  do  not  •■uxpact  ||AuJ|  to  be  very  large  and, 
thus,  believe  that  (4.18)  offers  a  reasonably  good  estimate  for  u*.  As  with  the 
adaptive  compensator,  the  reference  model  update  equations  do  not  change  in  this 
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derivation,  and  are  not  repeated.  The  hybrid  analog  of  (4.10),  which  minimizes 
the  norm  (4.11),  can  also  be  easily  derived. 

Additional  Remarks 

It  is  interesting  to  compare  the  controllers  that  have  been  obtained  from  the  base¬ 
line  linear  compensator,  through  successive  augmentation  by  cf  adaptation  and 
learning.  All  tlnee  perform  the  operations  issociated  wi„h  updating  the  refer¬ 
ence  model  and  computing  the  tracking  error  and  desire'^  tracking  output — see 
Table  4.1.  Moreover,  they  all  have  access  to  the  same  a  priori  design  model; 
namely,  that  given  by  (4.5).  Tney  differ,  however,  in  the  complexity  of  the  process 
model  considered  during  the  on-line  design  procedure;  in  turn,  this  impacts  the 
selection  of  the  control  vector  to  be  applied.  Only  the  hybrid  scheme  is  capable  of 
accommodating  changes  in  the  control  effectiveness,  as  a  function  of  the  state  and 
applied  controls.  The  Jacobian  matrix  (^n/<9u)  that  is  added  to  the  constant  ma¬ 
trix  (Cr)  accounts  for  such  changes. 

Note  that  in  the  case  of  the  h3'brid  compensator,  a  solution  that  perfectly  satisfies 
the  control  objective  (4.7)  is  not  guaranteed  to  exist  even  if  the  modified  control  ef¬ 
fectiveness  matrix  in  (4.18)  is  invertible.  Thu'»,  the  nonlinear  programming  prob¬ 
lem  associated  with  the  control  selection  may  not  have  a  solution;  whereas  in  the 
cases  of  the  linear  or  adaptive  compensators,  the  control  selection  problems  are 
linear  and,  hence,  are  guaranteed  to  have  unique  solutions  under  the  assumed 
invertibiiity.  In  practice,  this  might  occur  due  to  physical  constraints  such  as  ac¬ 
tuator  position  or  rate  limits  that  prevent  the  desired  tracking  output  from 

being  achieved  by  any  admi.ssible  control.  In  such  cases,  the  desired  output  is 
said  to  be  unreachable  given  the  admissible  control  values. 


Table  4.1.  Summarj'  of  Three  Related  Model-Reference  Compensators. 


The  network  approximation  model  n{x,u)  that  appears  in  the  input-output  real¬ 
ization  (4.15)  of  the  plant  dynamics,  represents  a  mapping  iK''  x  91'”  91'”,  where 

dini(x)-n,  dini(u)  =  m ,  and  dim(y)  =  m.  Alternative  model  mappings  can  be 
synthesized  that  are  compatible  with  the  state-space  realization  (4.2),  although  in 
such  cases  the  dimension  of  the  mappings  (two  are  required,  in  general)  is 
larger:  the  system  state  update  equation  would  be  augmented  with  a  mapping  of 
the  form  91”  x  91'”  -->9\''‘,  while  the  output  equation  would  be  augmented  with  a 
mapping  91”  >  91'”.  One  advantage  of  the  input-output  realization  is  an  economi¬ 

cal  use  of  network  re.sourcos. 


In  much  of  our  work,  we  have  used  a  simple  incremental  gradient  learning  algo¬ 
rithm  [Baker  &  P'arrell  (1992)]  to  update  the  adjustable  parameters  p  in  the  net¬ 
work  model  n(x,u;p).  To  employ  this  particular  algorithm,  it  must  be  possible  to 

A 

compute  (or  estimate)  the  gradient  (^J/c^p)  of  a  cost  function  J ,  with  respect  to  ^ 
the  parameters  p.  The  cost  function  we  have  chosen  to  minimize  is  based  on  the 
norm  of  the  output  error  of  the  network  mapping  (e  g.,  J  =  ^n’^n);  thus,  an  ex- 
prossion  for  this  error  is  needed.  From  (4.15)  it  is  easily  seen  that  the  output  error  % 
n  as'-ociated  with  n(x4_i,u*_i;p)  is  given  by 

n  =  y*  -C<I>x*_i-Cru*_i  -n(x*.i,u*_j;p)  (4.19) 

Once  this  error  has  been  determined,  the  cost  J  associated  with  it  can  be  as¬ 
sessed.  Consequently,  all  the  information  needed  to  update  the  adjustable  pa¬ 
rameters  (at  any  time  t>k)  by  means  of  the  incremental  learning  algorithm  is 
contained  in  the  vector-tuple  {n,Xt_.,u*_j}.  In  this  setting,  training  need  not  be 
"synchronized,"  in  the  sense  that  the  update  of  the  network  parameters  need  not 
occur  immediately,  and  can  in  fact  occur  at  a  much  later  point  in  time,  provided 
that  the  appropriate  vector- tuple  of  information  can  be  recalled. 

A  more  elaborate  discussion  of  the  learning  system  used  in  our  work  is  presented 
in  Section  5.6. 

4^  Applications  of  Hybrid  Control  to  Nonlinear  Systems 

Tije  indirect  l-ybnd  adaptive/learning  control  methodology  outlined  in  the  previ¬ 
ous  section  has  been  successfully  applied  to  a  number  of  nonlinear  dynamical 
systems.  The  re.«ults  associated  with  three  different  applications  of  hybrid  control 
are  summarizec  below.  Specific  details  of  each  application  example  and  experi¬ 
mental  setup  are  fully  documented  in  the  attachments  and  will  not  be  repeated 
here. 

Each  example  conforms  to  the  following  basic  scenario: 

•  the  dynamical  system  to  be  controlled  is  nonlinear 

•  the  only  a  priori  design  information  availaole  about  the  plant  is  a  single, 
constant  parameter  liiiear  model;  thus,  there  is  always  model  uncertainty 

•  the  de.sired  reference  model  is  repre.sented  by  a  stable  linear  syste^n 
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*-  the  desired  tracking  error  dynamics  are  represented  by  a  stable  linear 
system. 

^  4,a.i _ Cart-Pole  System  '  Split-Level  Track  Problem 

This  work  is  fully  documented  in  §5,  pp.  65-100,  of  Attachment  1  [Beiird  (1991)];  for 
related  work,  see  also  [Bail'd  &  Baker  (1990);  Baker  &  Farrell  (1990)].! 

The  "split-level  track  problem"  is  based  on  the  infamous  cart-pole  system  (an  in¬ 
verted  pendulum  on  a  translating  cart — see  figure,  p.  65,  Attachment  1).  The 
cart-pole  problem  is  a  staple  of  control  theory  textbooks  and  learning  control  pa¬ 
pers  (e.g.,  [Kailath  (1980);  Baito,  Sutton,  &  Anderson  (1983);  Friedland  (1986);  An¬ 
derson  (1989);  Morgan,  Patterson,  &  Klopf  (1990);  Baird  &  Baker  (1990);  Baker  & 
Farrell  (1990)]).  The  problem  is  to  move  the  cart  to  some  desired  track  position  by 
appl3dng  force  directly  to  the  cart  center  of  mass,  while  at  the  same  time  balanc¬ 
ing  a  rigid  pole  that  is  attached  to  the  cart  via  a  hinge.  A  key  feature  of  the  split- 
level  track  problem  is  that  the  track  is  not  flat,  and  instead  contains  an  incline 
that  is  not  included  in  the  design  model. 

The  hybrid  adaptive/leaming  control  methodology,  as  outhned  in  Section  4.1,  was 
used  as  a  position  controller  for  the  cart-pole  system  on  the  split-level  track.  In  a 
manner  consistent  with  (4.1)-(4.4),  the  only  a  priori  knowledge  of  the  plant  that 
was  available  was  a  single  linear  model.  The  desired  reference  model  was  linear, 
as  was  the  desired  tracking  error  dynamics. 

The  nonlinear  equations-of-motion  for  the  cart-pole  system,  open-loop  dynamics, 
model  parameter  values,  definition  of  the  split-level  track  problem,  linearized  a 
priori  design  model,  and  linear  reference  model  are  all  discussed  in  §5.1  of  At- 
tachmeiit  1.  Experimental  results  with  and  without  sensor  noise  are  provided  in 
§5.2  §5.6. 


1 


Thie  symbol  '  will  be  used  exclusively  to  refer  to  sections  that  appear  in  the  attachments,  the 
word  ’'section"  will  be  used  when  referring  to  material  that  appears  in  the  main  document. 
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Overall,  the  hybrid  controller  outperformed  both  a  linear  compensator  of  the  form 
given  by  (4.9)  and  an  adaptive  compensator  of  the  form  given  by  (4.14).^  In  related 
work,  uniQodeled  actuator  dynamics,  sensor  noise,  and  a  pure  time-delay  were 
also  incorporated  into  the  split-level  track  problem — again,  the  hybrid  controller 
outperformed  both  the  linear  and  adaptive  compensators  [Baker  &  Farrell  (1990)]. 

4.2.2  Aeroelastic  Oscillator 

This  work  is  fully  docximented  in  §4.1,  pp.  48-65,  of  Attachment  2  [Nistler  (1992)]; 
for  related  work,  see  also  [Atkins  (1993);  Cerrato  (1993)]. 

The  aeroelastic  oscillator  [Parkinson  &  Smith  (1963)]  ii  a  complex  nonlinear  sys¬ 
tem  that  exhibits  limit  cycle  behavior.  This  system  can  be  represented  a  sim¬ 
ple,  two-state,  mass-spring-dashpo^  model  of  an  aerodynamically  driven  oscilla¬ 
tor,  and  can  also  be  considered  as  a  simple  model  of  wing  flutter  or  other  similar 
aeroelastic  behavior. 

The  nonlinear  equations-of-motion  for  the  aeroelastic  oscillator,  its  open-loop  dy¬ 
namics,  linearized  a  prion  design  model,  linear  reference  model,  and  the  appli¬ 
cation  of  the  hybrid  controller  are  all  discussed  on  pp.  48-54  of  Attachment  2.  Two 
simulation  experiments  using  the  hybrid  controller  were  performed.  In  the  first 
example,  the  objective  was  simply  to  regulate  the  oscillator  to  zero  position  and 
zero  velocity  (i.e.,  to  null  out  limit  cycles  induced  by  the  nonlinear  aerodynamics 
via  an  applied  force).  In  the  second  example,  the  goal  was  to  command  and  hold 
an  arbitrary  position  (deflection),  wliile  maintaining  zero  velocity. 

In  both  experiments,  the  hybrid  adaptive ! learning  control  system  outperformed  a 
similar  control  system  having  adaptive  augmentation  only.  These  results,  to- 


^  The  hybrid  control  law  (4  18)  was  successful  in  tlus  application  even  though  the  cart-pole 
system  has  unstable  zero  dynamics.  The  explanation  for  th's  apparent  incon.'■^istency  is  that 
the  reference  model  used  in  this  work  was  derived  by  linearizing  the  car* -pole  system 
dynamics  (about  its  un.stable  equilibrium  position)  and  then  combining  this  open-loop  model 
with  a  stabilizing  linear  feedback  control  law.  Since  the  zeros  of  a  tramsfer  function  are  unaf 
fected  by  state  feedback,  the  desired  reference  model  included  the  original  nonminimum 
phase  .  t:roE  of  the  plant.  As  a  result,  no  attempt  was  made  to  automat. cally  "cancel"  the  d> 
namics  associated  with  the.se  nonmmimum  phase  zeros,  and  hence  the  control  law  wa.s  not 
susceptible  to  this  form  of  instability 
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gath  er  with  a  three-dimansionai  plot  of  the  nonlinear  dynamics  that  were  S3mthe* 
sizf  the  learning  system,  are  presented  on  pp.  55-65  of  Attachment  2. 

4.2.3  rhree-Degrec-of-Freedom  Flight  Control 

This  work  is  fully  documented  in  §4.2,  pp.  66-96,  of  Attachment  2  INistler  (1992)]; 
for  relaiced  work,  see  also  [Baker  &  Millington  (1882);  Millington  &  Baker  (1992); 
Baker  &  Millington  (1993);  Millington,  Baker,  k  Koenig  (1993)]. 

As  discussed  in  Chapter  3,  there  are  a  number  of  difficulties  associated  with  the 
design  of  flight  control  systems.  A  three-degree-of-freedom  (3-DOF)  nonlinear 
model  of  the  longitudinal  dynamics  of  a  representative  high  performance  aircraft 
was  obtained  from  a  full  6-DOF  nonlinear  aircraft  model  [Brumbaugh  (1991)]. ^ 
This  nonlinear  3-DOF  model  was  then  used  as  the  basLs  ot  a  more  challenging 
problem  for  the  hybrid  control  methodology,  relative  to  the  two  applications  con¬ 
sidered  above. 

A  description  of  the  nonlinear  aircraft  model,  linearized  a  priori  design  model, 
linear  reference  model,  and  a  discussion  of  relevant  application  issues  are  all  pre¬ 
sented  on  pp.  66-77  of  Attachment  2.  Once  again,  two  simulation  experiments 
with  the  hybrid  controller  were  performed.  In  the  first  example,  the  objective  of 
the  hybrid  controller  was  to  serve  as  a  simple  autopilot  (i.e.,  to  maintain  com¬ 
manded  altitude  and  airspeed  via  the  use  of  a  horizontal  stabilator  and  throttle). 
In  the  second  example,  the  operational  envelope  for  the  autopilot  was  expanded,  to 
make  the  problem  even  more  challenging. 

Once  again,  the  hybrid  adaptive / learning  control  system  outperformed  a  similar 
control  system  having  adaptive  augmentation  only  These  results  are  presented 
on  pp.  78-96  of  Attachment  2. 

4.3  I  teaming  Augmented  Estimation 

The  estimation  or  reconstruction  of  sy.stem  state  variables  from  obsen/ed  output 
measu-ements  typically  requires  an  accurate  model  of  the  system  On  the  other 

'  Th.i.s  6-!..)OF  modtl  is  identical  to  that  used  in  the  AtAA  Contn'li,  Design  Challenge. 


hand,  the  hybrid  control  and  learning  schemes  outlined  in  Section  4.1  assumed 
the  absence  of  an  accurate  model,  but  the  availability  of  accurate  state  estimates. 
Thus,  a  more  general  problem  exists,  involving  simultaneous  system  identifica¬ 
tion,  state  estimation,  and  control.  Obviously,  learning  could  be  used  to  address 
some  aspects  of  this  more  general  problem  if  the  estimation  and  control  processes 
were  allowed  to  utilize  the  modeling  information  provided  by  a  learning  system. 

(is  is  the  basic  idea  underlying  learning  augmented  estimation,  The  material 
presented  in  this  section  only  represents  a  first  step  in  this  direction — there  is 
much  room  for  further  analysis  and  development. 

Beginning  with  a  standard  linear  state  estimator,  the  discussion  belo  v  proceeds 
by  successively  incorporating  adaptation  and  learning  in  manner  completely 
parallel  to  the  development  of  the  hybrid  adaptive/leaming  control  methodology. 

Linear  Estimation 


The  standard,  steady-state,  Kalman  filter  propagation  and  update  equations  for  a 
linear  dynamical  system  of  the  form  (4.2)  are  given  by  ([Kalman  (I960)];  see  also 
[Friedland  (1986);  Gelb  (1974);  Jazwinski  (1970);  Maybeck  (1979);  Sorenson  (1985)]): 


X*  =x;+K(yJ’-Cx-) 


(4.20) 


where  indicates  the  estimate  of  the  state  after  propagation,  but  prior  to  incor¬ 
poration  of  the  measurement  y*^,  while  x*  represents  the  final  state  estimate  af- 
te.i^  the  measurement  update.  The  matrices  <l>,  F,  and  C  are  given  by  the  as¬ 
sumed  model  of  the  system,  and  K  is  the  steady-state  Kalman  gain  matrix  (which 
is  the  optimal  gain  matrix  under  certain  mild  assumptions). 


If  the  actual  system  dynamics  differ  from  those  described  by  (4.2)  (e  g.,  if  the  sys¬ 
tem  is  nonlinear),  then  the  estimates  generated  by  (4.20)  will  be  suboptimal,  and 
in  fact  may  be  quite  inaccurate. 


Adapt:ve  Augmentation 


In  an  attempt  to  accommodate  such  modeling  error  in  the  estimation  process, 
can  add  an  adaptive  con.'ponent  to  the  e-stimator.  This  component  seeks  to  im- 
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prove  the  accuracy  of  the  propagation  equation  by  addressing  the  modeling  error 
of  the  process,  using  the  difference  between  the  final  state  estimate  at  the  previous 
step  and  the  previous  linear  propagation  as  a  kind  of  time-delay  estimate  (in  a 
fashion  completely  analogous  to  what  was  done  previously  in  Section  4.1).  In  this 
case  the  design  model  for  the  process  becomes: 

*  *  1  *  1  (4  21) 

y*  =  Cx* 

where  the  vector  ^  represents  the  modeling  error  in  the  state  equation  (for  the 
present.,  we  assume  that  there  is  no  modeling  error  in  the  output  equation).  A 
time-delay  estimate  for  4*-i  <^an  be  obtained  from  previous  state  estimates  as  fol¬ 
lows: 

^*-1  =  **-l  “  ^**-2  “ 

Based  on  this  estimate  of  the  unmodeled  dynamics,  a  new  propagation  equation 
can  be  written  that  incorporates  such  adaptation 

A  complete  set  of  adaptive  filter  equations  can  be  defined  using  this  propagation 
equation  aind  the  update  equation  from  (4.20): 

l*-l  =  **-l  ”^*-2  ~ru*.2 

+  +  (4.22) 

X*  =--xi+K(y^-Cxi) 

These  equations  can  also  be  rewritten  in  a  form  that  is  similar  to  that  of  (4.20); 
when  this  is  done,  one  can  see  that  this  scheme  introduces  an  intermediate 
adaptive  step,  resulting  in  the  following  three-step  process: 

x*-x:+K(y:-cx:) 

where  in  this  case  represents  the  state  perturbation  estimate  after  propagation 
but  prior  to  the  adaptive  correction  and  mea.surement  update,  x*  represents  the 
state  e.stimate  after  the  adaptive  correction  is  made,  and  x^  is  the  final  state  e.sr.'- 
mate.  Some  empirical  results  [Millington  &  Baker  (1C92)|  have  verified  the  ability 
of  this  adaptive  augmentation  to  address  estimation  difficulties  introduced  by 
unmodeled  nonhnearities. 


Note  that  this  approach  does  not  account  for  the  possibility  of  modelingf  error  in 
the  output  equation.  If  a  new  design  model  for  tlie  process  w&x'b  used 

23) 

then  one  could  obtain  time-delay  estimates  for  the  modeling  emirs  and 
and  ultimately  arrive  at  the  following  adaptive  filter  equations: 


(4.24) 


sc*  X*  -t- 


Learning  Augmentation 


Like  adaptation,  learning  caxi  alao  be  incorporated  into  the  esvimation  p  ocess. 
Following  the  same  development  procedure  as  before,  we  begin  with  a  ciesign 
model  for  the  process; 


Xa  =  *^*X*.5 

y* 


(4„25) 


where  n*  and  represent  two  distinct  mappings  that  will  be  synt.hesi:(:ed  by  the 
learning  system,  and  and  O)  represent  a.ny  residual  un.modeled  dynamics,  .Aa 
before,  both  n'''  and  n''  are  implicitly  functions  oi'  time  since  they  will  be  Bvohing 
due  to  learning 


Given  the  design  model  (4.25),  the  complete  set  of  h^vnrt  !  adapiive,dearning  aug¬ 
mented  filter  equations  becomes 

I*.  1  =  X*.-l  “  -2  ) 

==yri--Cxv-r-n"(x*.  J 

.  -  (4,26) 

i*  -i*  --  K(y”‘  -Ci*  - ) - m*) 


An  incremental  gradient  .U'arning  algorithm  can  be  used  to  update  ea;  h  of  ti  e 
required  mappings.  In  this  case,  the  network  output  errors  h"'  and  li'  aasociat 
v^ith  r)i*(s:*„j,u*..j)  and  n'fiq),  respectively,  are  gi’ a  n  by: 
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n"=yr-Ci*-n^(i*) 

Additional  Remarks 

In  general  and  for  best  results,  the  gain  matrix  K  should  be  chosen  to  reflect  the 
local  (linearized)  system  dynamics  and,  hence,  should  be  adjusted  on-line.  One 
means  for  determining  a  suitable  gain  matrix  is  to  compute  the  optimal  steady- 
state  I‘s(aiman  gain  matrix  associated  with  the  local  (linearized)  system. 

/ 

Note  also  that  this  approach  requires  that  two  mappings  (one  for  the  state  equa¬ 
tion  and  one  for  the  output  equation)  be  synthesized  via  learning.  Because  our 
other  work  with  the  hybrid  adaptive/ieaming  control  methodology  was  based  on 
the  collapsed  input/output  system  description  given  by  (4.5)  (which  involved  a  sin¬ 
gle  mapping),  we  did  not  have  the  opportuiiity  to  evaluate  the  performance  of 
(4.22),  although  we  fully  expect  it  to  perform  better  than  either  (4.20)  or  (4.21)  when 
there  unmodeled  nonlinearities  ai'e  present. 
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5  Multiaxis  f  ligbt  Control 


A  main  object  ve  ot  i  p  ofram  was  to  demonstrate  the  feasibility  and  advan¬ 
tages  of  learning  \ugiuei  d  Hght  control  in  the  context  of  six-degree-of-freedom 
(6-DOF),  multiaxi  light.  oe  specific  feature  to  be  demonstrated  v/as  the  ability 
of  a  Ifc.  rning  iugniented  llij  nt  control  system  to  provide  high  performance  con¬ 
trol  despite  sigiiii  ant  modeling  uncertainty  and  nonlinearities  in  the  aircraft 
dynai  dcs.  Accordi  igiy  c  hybrid  adaptive/]  earning  control  system  was  developed 
and  applied  to  a  ci  allenging  multiaxis  flight  control  problem.  The  performance 
of  the  resui  ing  lear^^a  ag  augmented  flight  controller  was  contrasted  with  three 
similar  control  systems:  i  i  unaugmented  compensator  (a  priori  linear  design 
oi  l'  ),  an  adaptively  aitgm<  nted  compensator,  and  a  compensator  having  essen¬ 
tial  y  "ideal  L  a-uing"  auj^>;njentation. 

Of  the  ma  y  spv  ;  ire  ift  control  problems  that  could  be  selected  for  this  study, 
it  was  desir  d  t  '  oi  >  that  would  minimize  unnecessary  engineering  com¬ 

plexity  whih  st  -  u  ST  i  ting  the  advantages  of  learning  augmented  flight 
cc  itrol.  To  t  js  a  i.  u>i  dinal  and  lateral/directional  control  augmentation 
sy.stem  (CAS)  viti  i  urn  >.  >ra  con  wa.s  used  as  the  target  control  application. 
Tlie  rationale  iOr  thi  ’-ei  “  n  and  i  artdier  details  of  the  problem  are  di.scussed  in 
Section  5.1.  The  hyb  id  .  s  <  iller  was  evaluated  \da  a  closed-loop  simulation  of  a 
6-DOF,  non!  oeai-  air  'ah  <oae!  (Brumbaugh  (1991)].  A  brief  overv  ew  of  this 
model  is  givei  in  3ec  •  i  Thi-  salient  charactenstics  of  the  open-loop  dyn  im 
its  of  be  veh;i  <e  are  <  seni  *d  in  be-l(on  ;i.3.  The  "i  ivertibility"  of  the  assui  led 

plant  iynamu  wi!i  I  >,  par  ocular  i  erest,  {.pven  he  nonlinear  tont;  ol  tech¬ 

niques  used  in  lis  ;.u 

Th  :e  are  nun;,  rons  v  ;  s  i;  which  learniaj  i  sght  be  applied  to  the  selected 

nigtiT  ontrol  problem  *ire<-  u  -h  nethods  n  p  \sented  in  Section  5.4.  A  quali- 

tathr  evaluation  .  ‘'iht-  .  ajs  dat*  methods  i*  id:  to  the  selection  of  a  variant  of 
the  hybrid  lee  fit  aj;/a(  ip.ivt.  u.xttaclt  wriic  ■.  s  dis ’ui'ied  in  h.aapter  4  Tlie 
.‘ipplit  all  on  oi  tl  i  sch'-nie  t..j  1  ■■  G-DOH'  VA  ■  ■!  .si  on  for  th'  rutrihriear  aircraft 

sTsode;  iS  preseiii  .1  ii  .''et  non  ,.iud  inclu  ii  aer-cr  i jit  e >n  if  die  learning  sys¬ 
tem  ''rnich  was  i  .‘d.  ■''uiahv  ,3.  -i  .in  5.6  pirs.  ii  ,■  exporanenta!  results  of  the  hy 

hfd  tinLrol  systam  111  ‘le  n  u  --x  if  a  chali<  ap  ag  S  iaaiectory'  nianeuver. 
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5. 1  Control  Problem  Definition 


There  are  a  wide  range  of  6-DOF  flight  control  problems  that  can  be  studied. 
These  run  the  gamut  from  altering  the  natural  modes  of  the  aircraft;  i.e.,  stability 
augmentation  systems,  to  the  higher  level  functions  of  control  augmentation  sys¬ 
tems  and  autopilot  design.  Under  some  design  procedures  the  higher  level  func¬ 
tions  encompass  lower  level  designs.  The  desire  to  minimize  engineering  com¬ 
plexity  and  avoid  man-in-the-loop  issues,  while  still  achievii  g  the  task  objective, 
has  led  to  the  selection  of  a  control  augmentation  system  as  the  design  problem. 
Specifically,  the  demonstration  problem  is  the  design  of  a  longitudinal  and  lat¬ 
eral/directional  CAS  with  turn  coordination. 

The  demon;  tration  platform  is  a  nonlinear  model  of  a  modified  F-15,  as  discussed 
in  [Brumbaugh  (1991)].  The  d3mamics  of  this  vehicle  are  inherently  nonlinear  in 
angle-of-attack  a,  sideslip  /3,  airspeed  V,  and  altitude  H.  This  aircraft  has  four 
control  inputs  available:  rudder  aileron  symmetric  stabilator  S^,  and  dif¬ 
ferential  stabilator  6^.  The  throttle  setting  is  assumed  to  be  independently  con¬ 
trolled  (we  held  it  at  a  constant  value).  Thus,  the  control  problem  is  to  use  the  four 
control  inputs  to  track  reference  values  of  the  stability  axis  pitch  rate  and  roll 
rate  p^,  while  regulating  p  to  zero,  as  shown  in  Fig.  f  .1. 


X  -  I  p  q  r  V  a  p  0  0  hl^ 


Figure  5.1.  The  Basic  Control  Problem. 

For  the  purposes  ol  this  study,  it  is  nut  necessary  to  design  a  full-envelope  con¬ 
troller.  The  intent  is  to  design  and  demonstrate  a  controller  that  will  operate  over 
a  sufficiently  lar  ge  region  of  the  envelope  so  as  to  exercise  the  nor  linear  dyiiamics 
of  tlii  system.  To  that  end,  the  demonstration  will  focus  on  a  single  dem  mding 
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"S-trajectory"  maneuver  involving  rapid  airspeed  v  nation,  high  angles-of-at- 
tack,  and  significant  coupling  between  the  different  degrees  of  freedom. 

SJ&  Flight  Simulation 

The  nonlinear  aircraft  model,  including  its  actuators,  sensors,  and  atmospheric 
environment,  are  briefly  outlined  in  this  section.  For  more  information  regarding 
this  particular  vehicle  model,  please  refer  to  pp.  66-74  of  Attachment  2,  or  to 
[Brumbaugh  (1991)]. ^ 

Aircraft  Model 

The  simulation  code  used  in  this  work  is  based  on  a  six-degree-of-freedom,  rigid- 
bod>,  ii  gb  performance  aircraft  model,  includiii-ti  nonlinear  aerod;  namic  effects 
(based  on  empirically  derived  tabular  data),  nonlinear  engine  dynanacs,  and  non¬ 
linear  actua-or  dynamics  (including  rate  and  position  limits).  The  control  fea¬ 
tures  of  this  tifrcraft  include:  (i)  a  rudder  surface  mounted  on  a  single  vertical 
tail,  (ii)  an  all  m  ^ving  horizontal  tail  (stabilatcr)  c.tpable  of  symmetric  and  differ¬ 
ential  movement,  .  nd  (iii)  wing  ailerons.  A  more  detailed  description  of  the  basic 
aircraft  model  can  be  found  in  [Brumbaugh  (1991)]. 

Actuator  Models 

All  control  surfaces  employ  identical  actuator  dynamics,  with  0,033  s  time-con¬ 
stants  and  357s  rate  limiting.  Additionally,  the  stabilator  i.s  constrained  by 
asymmetric  position  limits  of  +15°  /  -25°,  while  the  aileron  and  rudder  saturate 
symmetrically  at  ±20°  and  ±30°,  respectively. 

Sensor  Models 

We  assume  the  cunlrol'er  has  access  to  a  full-state  sensor  suite,  including;  wind 
relative  angles  (angle-ot  attack  and  sideslip  angles),  altimeter,  airspeed  and 
Mach  indicators,  as  well  as  roll,  pitch,  and  heading  attitudes.  In  this  work,  the 
sensors  were  assumed  to  be  deal  (i.e.,  to  contribute  no  noise  or  delay). 

^  This  6- DOF  model  is  identical  to  that  used  in  the  AIAA  Controls  Design  Challenge 
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Gust  Model 


To  demonstrate  the  robustness  properties  of  c^r  system  while  it  was  learning,  we 
incorporated  moderate  atmospheric  turbulence  into  our  simulation.  Specifically, 
we  used  a  Dryden  gust  model  [MIL-STD-1797A  (1990)]. 

5^  Open*Loop  Pynamics 

Before  proceeding  with  any  type  of  control  system  design — whether  conventional, 
adaptive,  learning,  or  whatever — it  is  important  to  examine  any  a  priori  informa¬ 
tion  about  the  plant  d3mamics  which  may  be  available.  At  the  very  least,  this  in¬ 
formation  guides  the  design  of  the  overall  control  system  architecture  and,  in  our 
approach,  provides  the  required  information  to  develop  the  fixed,  conventional 
component  of  the  hybrid  controller. 

The  aircraft  model  consists  of  a  set  of  first-order,  nonlinear  differential  equations 
(Bnimbaugh  (1991)].  Many  of  the  "coefficients"  of  these  nonlinear  equations  are 
nonlinear  functions  of  angle-of-attack,  airspeed,  sideslip  angle,  altitude,  and  the 
applied  controls.  To  characterize  the  dynamics  of  this  aircraft,  the  nonlinear  dy- 
naraic.s  were  numerically  linearized  at  several  equilibrium  points  throughout  the 
operating  envelope,  ranging  in  speed  from  600  to  1,000  ft/s  and  in  altitude  from 
5,000  to  40,000  ft.  The  aircraft  eigenvalues  and  eigenvectors  were  found  to  be  quite 
typical  of  an  aircraft  of  this  type.  The  aircraft  is  open-loop  stable  with  the  excep¬ 
tion  of  an  unstable  phugoid  mode  at  a  low  altitude,  high  speed  operating  point 
(5,000  ft,  987  ft/sec).  The  operating  regime  near  H  =  5,000  ft  and  V  -  600  fl/s  is  of 
particular  into^est  since  the  evaluation  maneuvers  are  initiated  from  this  poini 
The  modal  frequencies  are  shown  in  Table  5.1  below. 
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Table  5.1.  Medal  Frequencies  for  H  ~  5,000  ft  and  V  =  600  ft/s. 


Frequency  (rad/s) 

Mode 

-1.79  ±  2.51i 

short  period 

-0.48±3.01i 

dutch  roll 

-2.78 

roll  convergence 

-1.60 

engine  core 

-0.015  ±  0.076i 

phugoid 

-0.026 

spiral 

-0.002 

altitude 

0.0 

heading 

Since  many  of  the  candidate  control  methodologies  perform  some  form  of  nonlin¬ 
ear  inversion  of  the  plant  dynamics,  the  zero  dynamics  (minimum/nonminimum 
phase)  characteristics  of  the  plant  are  of  particular  interest.  If  the  tracking  out¬ 
puts  are  chosen  to  be  y  -[p,q,Pf ,  then  the  system  has  a  multivariable  nonmini¬ 
mum  phase  zero  at  0.003  rad/s.  There  is  also  a  nonminimum  phase  zero  in  the 
.single-input/single-output  (SISO)  transfer  function  P{s)/6^{s)  at  0.043  rad/s.  The 
consequence  of  these  nonminimum  phase  properties  is  that  any  attempt  to  control 
sideslip  ^  directly  via  the  rudder  input  S^,  using  linear  or  nonlinear  inversion 
techniques  will  likely  result  in  an  unstable  closed-loop  system. 

5.4  Control  Methodology 

Conventionally,  the  control  system  "architecture"  refers  to  the  specification  of  the 
measurements  and  controls,  and  the  feedback  paths  between  these  variables.  The 
control  objectives,  the  open-loop  dynamics  of  the  aircraft,  and  the  availability  and 
complexity  of  the  design  methodologies  are  the  three  most  important  factors  in  de¬ 
termining  the  appropriate  control  architecture  in  conventional  control  schemes. 
For  learning  augmented  controllers,  the  augmentation  mechanism  is  an  addi¬ 
tional  design  degree-of-freedoin  that  must  be  addressed. 


5.4.1  Candidate  Methodologies 


In  this  subsection,  three  different  learning  augmentation  schemes  are  presented 
and  evaluated  in  the  context  of  the  CAS  problem. 

Feedforward  Augmentation 

The  feedforward  learning  augmentation  concept  (see  Fig.  5.2)  is  based  on  the  so- 
called  "two-parameter"  compensator  (Vidyasagar  (1985)1.  As  the  name  implies 
the  two-parameter  compensator  consists  of  two  components — a  feedback 
compensator  which  provides  robust  stability  and  disturbance  rejection,  and 

a  feedforward  compensator  (or  prefilter)  which  is  tuned  for  tracking  perfor¬ 

mance.  In  the  feedforward  learning  augmentation  scheme,  the  feedback  com¬ 
pensator  is  determined  by  standard  robust  linear  design  techniques.  However, 
the  feedforward  compensator  is  designed  on-line,  by  using  the  learned  local  lin¬ 
earized  dynamics  of  the  inner-loop  (plant  plus  feedback  compensator)  throughout 
the  flight  envelope.  During  the  early  stages  of  learning  (prior  to  parameter  con¬ 
vergence),  the  linear  model  of  the  inner-loop  might  be  identified  via  a  conventional 
recursive  identification  scheme. 


Adaptive  &  Laaming 

1 

1 - > 

Systems 

n 

Figure  5.2.  Feedforward  Learring  Augmentation. 

The  on-line  design  procedure  involves  the  stable  inversion  of  the  learned,  lin¬ 
earized  model.  Given  knowledge  of  the  local  linearized  inner-loop  dynamics 
the  prefilter  that  minimizes  the  //„  norm  [Maciejowski  (1989))  of  the 
tracking  error 
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is  the  stable  inverse  of 

Kn  =  P"{Pr'). 

where  =  P^P^  is  the  inner/outer  factorization  [Maciejowski  (1989)]  of  P^,  and 
denotes  the  stable  part  of  I]~^.  This  technique  is  tolerant  of  nonminimum 
phase  dynamics  since  inversion  of  only  the  minimum  phase  portion  of  the  plant 
ensures  that  is  a  stable  transfer  function. 

The  advantage  of  this  architecture  is  that  the  features  of  learning  augmentation 
will  enhance  the  performance  of  the  system  in  the  presence  of  modeling  uncer¬ 
tainties,  unexpected  changes  in  the  plant  dynamics,  and  nonlinearities,  and  yet 
the  stability  of  the  inner-loop  will  be  insensitive  to  learning  failures  if  the  learning 
dynamics  are  slow  relative  to  the  inner-loop  dynamics  (which  is  normally  the 
case).  This  "learning-fail-safe"  architecture  might  be  appropriate  for  flight  test 
applications,  where  reliability  is  critical. 

One  disadvantage  of  this  scheme  is  the  requirement  to  identify  the  local  linear 
representation  of  the  augmented  plant.  The  identification  of  such  a  highly  struc¬ 
tured  model  poses  problems  for  both  learning  based  and  conventional  system 
identification  algorithms.  While  the  current  learning  algorithm  (see  Section  5.6) 
provides  a  good  input/output  map  of  the  desired  nonlinear  fimction,  extracting  lo¬ 
cal  linear  models  fi'om  the  map  can  be  difficult  and  can  exacerbate  small  errors 
due  to  the  fact  that  one  must  differentiate  this  map  to  find  the  local  Jacobian  ma¬ 
trices  with  respect  to  the  inputs.  Additionally,  the  development  of  a  system  identi¬ 
fication  algorithm  (e.g.,  via  an  extended  Kalman  filter  or  recursive  maximum 
likelihood  technique)  for  6 -DOF  aircraft  dynamics  is  an  enormous  problem  in  its 
own  right.  Substantial  supervisory  logic  could  be  required  to  guarantee  parame¬ 
ter  convergence  in  the  presence  of  disturbances  and  sensor  noise. 

The  computational  burden  associated  with  this  on-line  inversion  algorithm  is  an¬ 
other  concern.  Let  n  be  the  number  of  states  of  the  system.  The  stable  inversion 
algorithm  involves  the  solution  of  the  algebraic  matrix  Ricatti  equation  [Kailath 
(1980)1,  which  requires  on  the  order  of  73n^  nultiplications/divisions,  82n®  addi- 
tions/'subtractions,  and  square  root  operations  when  using  the  Schur  method 
[Ramesh,  Senol,  &  Garba  (1989)].  / 
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Plant  Augmentation 


This  scheme  is  an  extension  of  the  model  reference  adaptive  control  (MRAC)  con¬ 
cept  [Astrom  &  Wittenmark  (1989);  Narendra  &  Annaswamy  (1989)].  in  this  case, 
the  controller  would  consist  of  a  linear  reference  model  P„(s),  the  adap- 
tive/leaming  system,  and  one  of  two  linear  compensators,  Th® 

purpose  of  the  adaptive  and  learning  systems  is  to  augment  the  d3mamics  of  the 
nonlinear  plant  so  that  it  has  the  same  input/'output  behavior  as  the  reference 
model.  The  usual  division  of  responsibilities  between  the  adaptive  and  learning 
components  applies  here,  with  the  adaptive  component  accommodating  time- 
varying  dynamics  and  the  learning  component  addressing  state-dependent  non- 
linearities.  The  reference  model  /^(s)  is  a  linearized  model  of  the  nominal  plant 
dynamics  (see  Fig.  5.3). 


P_(s) 


Figure  5.3.  Plant  Augmentation. 

At  any  instant,  linear  control  is  provided  by  either  a  "robust"  compensator 
or  a  "high  perfonnance"  compensator  The  robust  controller  would  be 

used  during  the  early  stages  of  learning  when  the  augmented  plant  (shaded  area 
of  Fig.  5.3)  may  deviate  significantly  fi-om  P„(s).  During  this  period,  the  large 
plant  residual  will  cause  the  switching  logic  to  close  the  loop  through  the  ro¬ 
bust  controller.  "^Tien  learning  has  converged,  the  switching  logic  closes  the 
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outer-loop  through  the  high  performance  controller.  Both  linear  controllers 
would  be  synthesized  using  the  ,  n  synthesis  robust  design  methodology;  how¬ 
ever,  the  modeling  uncertainty  A(s)  assumed  in  the  "robust"  design  would  be 
much  greater  than  that  assumed  for  the  "high  performance"  design  (see  Fig.  5.4), 
Thus,  K^(,{s)  should  be  more  robust  to  errors  in  the  augmentation  process,  while 
Kf,p{s)  should  be  more  finely  tuned  to  the  reference  model  M{s),  and  should 
provide  superior  performance  when  the  augmented  dynamics  are  close  to 


Figure  5.4.  Model  for  Lineai*  Compensator  Design. 

The  residual  monitoring  and  subsequent  transition  to  the  robust  compensator 
provides  for  a  limited  fail  safe  quality  if  the  learning  and/or  adaptation  algo¬ 
rithms  fail  to  provide  the  desired  augmentation.  Performance  would  be  excellent 
when  the  learning/adaptation  systems  have  converged. 

Augmentation  Via  Dynamic  Inversion 

Dynamic  inversion  has  been  used  in  the  robotic.^  ^ield  and  in  many  other  applica¬ 
tions  for  a  number  of  years  [Siotine  &  Li  (1991)  It  is  a  highly  effective  scheme 
when  the  plant  model  is  well  known.  A  major  weakness  of  this  scheme  is  the 
high  degree  of  performance  sencitivity  to  modeling  error.  This  is  an  area  where 
adaptation  and  learning  augmentation  could  provide  a  significant  performance 
boost.  Adaptive  and  learning  systems  (see  Fig.  5.5)  could  jointly  compute  an 
estimate  f(x,u)  of  the  actual  plant  dynamics  f(x,u).  A  conti  ol  would  be  selected 
that  directly  cancels  the  nonlinear  dynamics  (as  characterized  by  the  estimate) 
and  then  injects  the  desired  error  dynamics.  This  is  essentially  the  same  hybrid 
control  scheme  that  was  detailed  in  Section  4.1.  A  brief  summary  of  this  scheme 
as  it  might  apply  to  the  attitude  rate  control  problem  is  presented  below. 
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Figxire  5.5.  Augmentaucn  via  Dynamic  Inversion. 


Assume  that  the  plant  dynamics  and  outputs  are  given  in  continuous  time  by 

x  =  f(x,u) 
y  =  Cx 

and  the  desired  tracking  error  d3mamics  are  specified  as 

y  +  Ey  =  0 

where  y  =  y^  -  y,  and  E  is  a  diagonal  matrix  with  positive  elements  (so  the  sys¬ 
tem  is  stable).  If  the  error  dynamics  are  enforced,  the  output  errors  y  will  expo¬ 
nentially  decay  to  zero  with  time  constants  =  1/^u  •  Th®  error  dynamics  may  be 
expressed  in  terms  of  the  plant  dynamics  by  differentiating  the  output  equation 

y  =  Cx 

=  Cf(x,u) 

and  substituting  this  into  an  expression  for  the  derivative  of  the  error 

y  =  y<i-y 

=  y^-Cf(x,u) 

The  hybrid  adaptive/leaming  system  can  be  used  to  generate  an  estimate  of  the 
plant  dynamics  f(x,u).  By  selecting  u  such  that 

y,  +  Ey-Cf(x,u)  =  0 

the  desired  error  dynamics  are  obtained  if  the  plant  is  identified  exactly. 

The  key  to  this  approach  is  the  accurate  estimation  of  the  plant  dynamics.  Also, 
caution  must  be  used  in  selecting  the  plant  inputs  and  outputs  to  ensure  that  the 
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plant  d5niainics  are  minimum  phase,  since  any  attempt  to  invert  nonminimum 
phase  dynamics  may  lead  to  instability. 

This  scheme  is  attractive  because  only  an  input/output  representation  of  the  dy¬ 
namics  is  required — a  structured  model  is  not  necessary.  In  addition,  this  ap¬ 
proach  does  not  require  the  development  of  a  structured  adaptive  control  or  sys¬ 
tem  identification  algorithm.  The  simple  time-delay  adaptive  augmentation  de¬ 
scribed  in  Section  4.1  is  a  much  simpler  adaptive  algorithm,  and  may  be  used  to 
facilitate  learning  during  its  convergence  phase  and  to  handle  novel  and  time- 
var3dng  dynamics. 

5.4.2 _ Control  Methodology  Selection 

Both  the  feedforward  augmentation  and  plant  augmentation  schemes  offer  some 
interesting  featxires.  For  example,  both  have  some  degree  of  robustness  to  learn¬ 
ing  and  adaptation  imperfections.  Unfortunately,  these  schemes  require  signifi¬ 
cant  development  of  enabling  technologies  before  they  can  be  applied  to  the  prob¬ 
lem  at  hand.  The  dynamic  inversion  algorithm  is  a  simpler,  proven  algorithm 
that  has  been  demonstrated  on  aircraft  control  problems  of  a  smaller  scale  (e.g., 
see  Section  4.2)  .  Given  the  limited  scope  of  the  current  effort  and  the  past  experi¬ 
ence  gained  with  this  approach,  the  learning  augmented  dynamic  inversion 
scheme  was  selected  as  the  control  methodology  to  be  used  in  conjunction  with  the 
6-DOF  CAS  design  demonstration. 

5  JS  Controller  Design 

This  section  describes  the  design  of  the  control  augmentation  system  using  the 
hy  brid  control  scheme  detailed  in  Chapter  4.  In  review,  the  objective  of  the  control 
system  is  to  track  pitch  rate  and  roll  rate  commands  (in  stability  axes), 
while  maintaining  turn  coordinated  flight  =0).  This  section  begins  with  a  de- 
.scription  of  the  top-lev^*!  ^rrhiterfiirA,  followed  by  a  detailed  description  of  the  a 
priori,  adaptive,  and  learning  components. 
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5.5.1  Control  Architegture 


The  control  system  consists  of  an  outer-loop  regulator  and  an  inner-loop  angu¬ 
lar  rate  tracker  (see  Fig.  5.6).  The  inputs  are  commanded  stability-axis  roll  rate 
pitch  rate  and  sideslip  angle  =  0.  This  type  of  structure  with  the  attitude 
rates  controlled  via  dynamic  inversion  in  an  i  ner-loop,  and  the  wind-axis  atti¬ 
tudes  controlled  via  dynamic  inversion  in  an  outer-loop,  has  been  used  success¬ 
fully  in  other  studies  (e.g.,  [Bugajski,  Enns,  &  Elgersma  (1990)1).  The  outer  /3  loop 
is  necessary  because  it  makes  neither  physical  nor  mathematical  sense  to  regu¬ 
late  the  sideslip  angle  under  direct  rudder  control.  The  predominant  forcing 
term  of  the  dynamics  is  the  stability-axis  yaw  rate.  In  addition,  the  dynamics 
from  the  rudder  command  to  sideslip  angle  are  nonminimum  phase.  The  conse¬ 
quence  of  performing  dynamic  inversion  on  a  nonminimiun  phase  plant  is  anal¬ 
ogous  to  attempting  to  cancel  a  right  half-plane  zero  with  a. controller  pole.  Thus, 
stability-axis  yaw  rate  is  used  as  a  pseudo  control  for  the  ^  controller,  and  this 
rate  command  along  with  the  stability-axis  roll  and  pitch  rate  commands  com¬ 
prise  the  inputs  to  the  angular  rate  controller.  The  following  subsections  present 
the  details  of  the  fixed,  adaptive,  and  learning  components  of  the  angular  rate  and 
p  controllers. 


Figure  5.6.  Top-levei  Architecture  of  the  Angular  Rate  CAS. 

5.5.2  Angular  Rate  Control 

The  angular  rate  control  system  (see  Fig.  5.7)  is  composed  of  reference  model 
dynamics,  error  djmamics,  and  plant  inversion  functions.  The  reference  model 
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and  error  dynamics  are  performed  in  stability-axes,  while  the  plant  inversion 
function  is  performed  relative  to  the  body-axis  dynamics.  The  reference  model  is 
a  prefilter  that  specifies  the  desired  d3mamics  of  the  aircraft  under  "perfect"  con¬ 
trol  (i.e.,  if  initial  tracking  errors  are  zero,  there  are  no  disturbances  present,  and 
plant  inversion  is  performed  exactly).  The  inputs  to  the  reference  model  are  the 
commanded  stability-axis  body  rates  [p,9,r]^  and  the  outputs  are  the  reference 
stability-axis  body  rates  [p,g,r]\  Since  disturbances  are  always  present,  and  the 
plant  inversion  process  will  not  be  perfect,  some  error  may  accumulate.  The  er¬ 
ror  dynamics  specify  how  the  system  is  to  respond  to  rate  erroi^s.  If,  for  example, 
the  initial  tracking  error  is  significant,  it  is  unreasonable  to  ask  the  plant  to  con¬ 
verge  to  the  reference  input  in  one  control  cycle  (deadbeat  response),  since  the  re¬ 
quired  control  effort  will  be  excessive.  Thus,  the  error  dynamics  smooth  the 
tracking  convergence  phase  over  several  control  cycles.  The  error  dynamics  may 
also  be  augmented  with  integral  action  to  boost  the  low  fi-equency  gain  of  the  sys¬ 
tem.  The  objective  of  plant  inversion  is  to  determine  the  control  vector  that  drives 
the  system  outputs  to  their  desired  values  [p,(7,r]^  over  the  next  control  cycle. 
Plant  inversion  is  performed  by  using  a  priori  information  as  well  as  adaptively 
gained  and  learned  knowledge  of  the  plant  dynamics. 


Figure  5.7.  The  Angular  Rate  Control  System. 
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Reference  Model 


The  reference  model  embodies  the  desired  dynamics  for  the  closed-loop  system.  It 
also  acts  as  a  lowpass  filter  for  any  "non-trackable"  high  frequency  signals  pre¬ 
sent  in  the  command  inputs,  so  that  the  reference  signals  sent  to  the  controller 
are  physically  realizable  by  the  aircraft.  Separate  lateral/directional  and  longitu¬ 
dinal  reference  models  simplify  the  design  while  ensuring  consistency  between 
yaw  rate  and  roll  rate  reference  signals.  The  reference  models  were  generated  by 
a  four-step  procedure  involving:  (i)  linearization  of  the  nonlinear  equations-of-mo- 
tion  about  a  nominal  operating  point  at  H  -  5,000  ft  and  V  =  600  ft/s;  (ii)  design  of 
a  linear-quadratic  (LQ)  servo  controller,  which  yielded  the  desired  dynamics;  (iii) 
determination  of  the  closed-loop  d3mamics  of  the  system  under  LQ  sei’vo  control; 
and  (iv)  model  reduction  and  conversion  to  discrete-time  of  the  closed-loop 
dynamics.  The  resulting  reference  models  represent  compact,  linear  set  of  dif¬ 
ference  equations  of  the  form 

VrMl  ~  ^r{^r^r,k  ) 

where  yr=[Pr>^rri  X  c  9?*  for  the  lateral/directional  model,  and 

y  =  q,,  u  =  q^,  and  for  the  longitudinal  reference  model.  This  procedure 

guarantees  that  the  reference  signals  are  consistent  and  achievable  (at  least  in  a 
linear  sense). 

Error  Dynamics 

The  error  d)rnamics  models  determine  how  the  system  will  react  to  initial  track¬ 
ing  errors,  imperfect  plant  inversion,  and  disturbances.  Fast  error  dynamics 
provide  rapid  convergence  to  the  reference  values  at  the  expense  of  a  higher  level 
of  control  effector  activity.  Slow  error  dynamics  provide  sluggish  response,  but 
with  lower  levels  of  control  activity.  The  p,q,r  error  dynamics  consist  of  three 
decoupled  difference  equations  that  provide  both  proportional  and  integral  (PI) 
compensation  of  the  output  errors.  The  error  dynamics  (see  Fig.  5.8)  generate  the 
desired  output  y^  that  is  sent  to  the  plant  inversion  function.  The  desired  output 
is  given  by 

where  ~  yr  k~  Vk  integral  of  the  error.  If  plant  inversion  is  perfect 

(i.c.,  if  -  y^^  ),  then  the  error  dynamics  are  given  by 
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c**i-M*-*i<^*  =0 

If  plant  invei  sion  is  imperfect,  then  the  error  dynamics  become 

~  ~  ~  ^k 

where  5^  is  an  inversion  error  term;  i.e.,  =  yd.k*i~^k-  main  feature  of  the 

integral  term  is  its  ability  to  drive  the  output  error  to  zero  even  in  the  presence  of 
plant  model  bias  errors. 


Figure  5.8.  Proportional  Plus  Integral  Error  Dynamics. 

Plant  Inversion 

The  plant  inversion  process  computes  the  current  control  u*  as  a  function  of  the 
current  state  x*  such  that  the  output  at  the  next  time  step  is  equal  to  the  desired 
output  y*+]  =  yrf,*+i-  For  the  inner  rate  loop,  the  control  consists  of  the  four  control 
surface  commands 

and  the  state  vector  consists  of  the  three  body  rates,  airspeed,  angle-of-attack, 
sideslip  angle,  and  the  pitch  and  roll  angles 

'X.,,  =  [pq  rV  a  Pd<f>]l 
and  the  outputs  are  the  body  rates 

yA=IP9/-]; 

The  control  system  is  composed  of  an  a  priori  linear  component,  an  adaptive 
component,  and  a  learning  component.  By  design,  the  a  priori  model  of  the  plant  v 

is  a  poor  representation  of  the  dynamics  over  the  maneuver  envelope.  This  was 
done  to  facilitate  evaluation  of  the  hybrid  learriing/adaptive  augmentation. 
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A  Priori  Rate  Control 


The  linear  a  priori  control  is  determined  via  dynamic  invei  sion  of  a  linear  model 
i  of  the  aircraft.  Tha  linear  model  was  obtained  by  numerically  linearizing  the 
nonlinear  plant  at  a  trim  condition  corresponding  to  an  airspeed  of  600  fl/s,  an  al¬ 
titude  of  5,000  ft,  and  an  angle-of-attack  of  3  deg.  This  is  a  poor  model  over  the 
•  spectrum  of  conditions  that  will  be  experienced  during  the  demonstration  ma¬ 
neuver,  where  airspeed,  altitude,  and  angle-of-attack  range  from  400-1,100  ft/s, 
500-23,000  ft,  and  0-20  deg,  respectively. 

Given  the  nonlinear  system  =f(x4,Ui)  with  trim  condition  x°  =-  f(x®,u‘’),  the 
linear  model  may  be  expressed  as 

=  (S-i) 

y.-y"  =  C(x.-x”)  (5.2) 

The  output  d5mamirs  are  given  by  substituting  (5.1)  into  (5.2) 

y..,  =  C<I>(x.-x”)+Cr(u,-u")  +  y‘  (5.3) 

The  linear  control  is  determined  by  setting  and  solving  for  u* 

u*  =  u”  -C«l>(x*  -X'’)]  (5.4) 

where  ( )*  denotes  the  pseudo  inverse  of  the  operand.  Since  (CF)  is  of  full  rank,  a 
solution  always  exists  and  the  solution  minimizes  the  Euclidean  norm  of  the 
control  vector. 

Adaptive  Augmentation 

Adaptive  control  is  required  to  compensate  for  the  nonlinear  dynamics  while  the 
relatively  slow  process  of  learning  builds  the  required  input-output  map.  'Fhe  type 
of  adaptation  selected  is  related  to  time-delay  control  i,TDC)  [Youcef-Toumi  &  Ito 
(1990)].  This  is  a  simple  scheme  that  is  easy  to  implement  and  is  compatible  with 
the  formulation  of  the  a  priori  and  learning  control  components.  Time-delay  ci  n- 
trol  estimates  an  unstructured  forcing  term  in  the  dynamics  by  comparing  the 
predicted  output  provided  by  the  linear  model  with  the  actual  output  at  the  current 
time  step. 
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The  output  d)Tiamics  are  written  as  the  sum  of  the  linear  expression  of  (5.3)  plus 
an  unknown  nonlinear  correction  term  'P(sc*,u*) 

y..i  =  C«{)C.-x'’)  +  Cr(u,  -u')  +  y"+>|-(x,.u.) 

The  current  measurement  or  estimate  of  the  ouk-put  can  be  used  to  estimate  the 
past  value  of  the  correction  term 

=  y*  -  cr(u*_,  -  u")  -  y" 

Assuming  that  the  variation  in  the  nonlinear  correction  term  is  small  between 
control  cycles  implies  that 

so  that  the  current  estimate  is  given  by 

*(*..  u. )  =  y.  -  C<»{x,.,  -  X')  -  cr(u..,  -  u")  -  y" 

Thus  the  a  priori  plus  adaptive  control  becomes 

u.  =  u*  +  (crr[y,.... -r- C*(x.  - X”)- >1'.] 

Because  this  estimate  of  4*  is  unstructured  (the  dependencies  of  *F  on  x*  and  u, 
are  not  directly  observable),  this  form  of  adaptive  augmentation  does  not  accom¬ 
modate  errors  in  the  F  matrix  or  nonlinearities  in  the  control. 

The  estimation  process  is  illustrated  in  Fig.  5.9.  Note  that  the  raw  estimates  are 
filtered  since  noise  on  the  state  measurements  will  propagate  directly  into  'V , 
which  is  a  weakness  of  this  adaptive  scheme. 


A 
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Figure  5.9.  Estimation  of  Nonlinear  Correction  Terms. 

Learning  Augmentation 

While  adaptation  is  used  to  compensate  for  modeling  error  while  the  learning 
system  converges,  the  ultimate  goal  is  to  supplant  the  responsibilities  of  the  adap¬ 
tive  scheme  with  the  learned  map  as  the  map  becomes  accurate.  Unlike  the 
adaptive  control  component,  the  learning  component  is  less  susceptible  to  noise 
and  is  able  to  compensate  for  errors  in  the  I'  matrix.  This  section  describes  how 
the  learned  map  is  used  to  generate  the  learning  augmented  control  signal.  The 
details  on  how  the  learned  map  is  generated  are  discussed  in  Section  5,6. 

The  network  builds  a  forward  map  n{Xf^,u^)  lor  the  nonlinear  compensation  term 

so  that  the  design  model  for  the  control  is  given  by 

y*..  -  e  cr(u,  -  u^')  f  y'’  +  ri(x,,u,)  .e 


The  adaptive  term  now  accounts  for  output  errors  in  the  new  design  model  (a  pri¬ 
ori  plus  learned) 

t  =  y.  - -  x")-Cr(u...  -  u”)- y”  -  n(x..„ u,.,) 

Since  the  control  objective  is  u*  must  be  solved  from 

1  - -X°)- cr(u*  - U" ) - y"  - n(x* , u* ) -  V*  =  0  (5.5) 

To  achieve  a  closed-form  solution  for  u*  it  is  necessary  to  linearize  the  network 
term  about  the  current  state  and  previous  control 


*»(**.«*)  + 


9a 

9q. 


(5.6) 


substitute  (5.6)  into  (5.5)  and  solve  for  the  rrent  control  u* 


r 


«*  = 


daX 


cr+^ 

V  9u 


y^Mi  -  y°  -  -  x")  -  +  CFu"  -  n(x*,Ut.i) 


da 


i 


5.5.3  Sideslip  Control 

The  objective  of  the  sideslip  controller  is  to  maintain  coordinated  flight  by  regulat¬ 
ing  p  to  zero.  Tlie  p  controller  is  the  outer  loop  that  feeds  values  of  commanded 
stability-axis  yaw  rate  r^  to  the  inner  rate  loop.  Thus,  is  the  "control"  signal  of 
the  outer  loop.  Since  the  outer-loop  d3naamics  are  naturally  slower  than  those  of 
the  angular  rate  dynamics,  it  is  reasonable  to  neglect  the  inner-loop  dynamics 
when  designing  the  outer  loop;  i.e.,  the  dynamics  from  to  r  are  considered  to  be 
high  frequency  "actuator"  dynamics  for  the  pxirposes  of  the  f)  controller  design. 
This  approach  is  feasible  as  long  as  the  bandwidth  of  the  jS  loop  (as  dictated  by  the 
P  error  dynamics)  is  sufficiently  lower  than  that  of  the  inner  loop. 

The  structure  of  the  P  controller  is  very  similar  to  the  rate  control  structure,  with 
the  exception  that  there  is  no  need  for  a  reference  model  since  the  control  objective 
is  regulation.  The  p  controller  does  have  error  dynamics  and  plant  inversion 
components  that  perform  the  same  functions  as  those  in  the  imier-loop  (see  Fig. 
5.10).  The  error  dynamics  are  essentially  identical  to  those  of  the  rate  controllers, 
and  include  both  proportional  and  integral  feedback  of  the  error  signal.  The  plant 
inversion  process  is  also  very  similar. 
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}fl«ference  Desired  Pseudo  Control 


Figure  5.10.  The  ^  Regulator. 


The  expression  for  tho  0  dynamics  is  given  in  body  coordinates  as  [Brumbaugh 
(1991)]: 

0  =  [Dsin]S  + Tcos^-Zj.  cosasin  0  +  Y^  cos0- Zj.sinaBm0]/Vm 
+ g[sm  6  cos  asin0  +  cos  6  sin  <j>  cos  0  -  cos  6  cos  0  sin  a  sin  0 \JV  (5.7) 

+psina  — rcosot 

The  first  term  on  the  right-hand  side  represents  the  effect  of  aerodynamic  and 
thrust  forces  on  0.  It  is  a  poorly  known,  complex,  nonlinear  function  of  the 
states.  As  such,  it  is  very  difficult  to  precompute,  and  is  best  accommodated  via 
the  adaptive  and  learning  components  of  the  controller.  The  second  and  third 
terras,  on  the  other  hand,  are  relatively  simple,  well-defined,  and  easy  to  compute 
(assuming  that  measurements  or  estimates  of  the  relevant  states  are  available). 
The  last  two  terms  dominate  the  expression  and  are  equal  to  the  stability-axis  yaw 
rate;  i.e.,  the  "pseudo"  control  of  the  outer-loop: 

=  r  cosa  -  p  sin  a 

ITie  details  of  the  a  priori,  adaptive,  and  learning  components  of  the  plant  inver¬ 
sion  are  provided  b  jlow. 

A  Priori  Componen  t  of  0  Plant  Inversion 


The  0  equation  may  be  discretized  using  a  simple  Euler  approximation 
«  fl  .  A 'T  cos  a*  sin  0^  +  cos  0^  sin  0*  cos  0^  1  / 


0k+l  ~ 


-  cos  6^  cos  sin  a*  sin  0,, 
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where  the  effect  of  the  aerodyBamic  and  thrust  forces  are  treated  as  unknown  dy¬ 
namics  and  are  lumped  into  the  'F  term.  Tlie  objective  of  plant  inversion  is  to  find 
the  control  r,  that  achieves  the  desired  value  of  sideslip  at  the  next  control  cycle. 
The  a  priori  component  of  the  control  is  determined  by  solving  this  equation  for 
the  commanded  value  of  stabihty-axis  yaw  rate  (and  neglecting  F), 


(5.9) 


Adaptive  and  Learning  Components  of  ^ Inversion 


Equation  (5.8)  may  be  written  as 

where  y  =  P,  u  =  r^,  and  /‘(x*)  +  6u*  represents  the  a  priori  component  of  the  ^ 
d3mamics  (the  right-hand  side  of  (5.8)).  The  network  will  build  a  forward  map 
n(x*,7i;j)  for  the  aerod3nnamic  and  thrust  forces  that  are  absent  from  (5.8) 

-  K-i 

so  that  the  new  design  model  with  learning  is 

>'**1  =  /'(**)  + few* +n(x*,n*)  (5.10) 

The  adaptive  system  will  account  for  output  errors  in  (5.10) 

=  .V*  -  /"(x*-!)-  feu*.i  -  n*.i) 

and  the  complete  design  model  is  given  by 

.V**i  =  /■(x*)  +  bu*  +  n(x*,n* )  +  •f'*  (5.11) 

The  control  objective  is  =  yd,*+i.  so  that  the  control  must  be  solved  from  (5.11). 


Recall  from  the  inner-loop  problem,  that  the  network  dynamics  had  to  be 
linearized  to  obtain  a  closed-form  solution.  A  similar  procedure  must  be  carried 
out  here,  rersulting  in 


u. 


Yd.k.i-  J  -f-  .  j  -  •f'* 

,  dn 


A  linear-Gaussian  network  with  an  incremental  gradient  learning  algorithm 
was  used  to  form  the  basis  of  the  learning  system  in  this  example.  The  in¬ 
put/output  equations  for  this  network  are  repeated  below  (see  Section  3.5  for  more 
detail): 


where  x  and  y  are  the  input  and  output  vectors,  respectively,  n  is  the  number  of 
nodes  in  the  network,  f-(x)  are  the  local  basis  functions,  and  ri(x)  are  the  nor¬ 
malized  influence  functions,  which  are  defined  to  be 


-■  "  with  0  <  rj(x)  <  1  and  ^  r^(x)  =  1 

'LrM) 


j=l 


In  the  case  of  a  linear-Gaussian  network,  the  functitmii'  f;(x)  and  /^(x)  become 

f^(x)  =  M.(x-x')-(-b^ 
y,(x)  =  Ci  exp|-(x  -  x“fQ,(x  -  x‘)} 


For  the  work  presented  here,  the  matrices  and  the  vectors  b;  were  adjustable, 
but  the  matrices  Q,  (each  Q;  must  be  symmetric  positive  definite),  the  vectors  x® , 
and  the  scalars  c-  were  all  held  constant.  As  shovm  in  Fig.  5.11,  a  total  of  n-21 
linear-Gaussian  node  pairs  were  used  in  this  network.  ^  The  network  had  eight 
inputs  covering  ^lach  number,  angle-of-attack,  sideslip,  dynamic  pressure,  as 
well  as  aileron,  horizontal  stabilator,  differential  stabilator,  and  rudder  inputs. 
The  four  outputs  of  the  network  represent  learned  (but  initially  unmodeled)  dy¬ 
namics  in  roll  rate,  pitch  rate,  yaw  rate,  and  sideslip  as  a  function  of  the  eight  in¬ 
puts. 


^  Although  Fig,  5.11  seems  to  indicate  otherwhse,  the  linear-Gaussian  nodes  in  the  network  are 
really  arranged  in  a  single  layer  (which  has  been  folded  to  make  the  figure  more  compact). 


Figure  5.11.  Linear-Gaussian  Network  Used  in  the  Hybrid  Controller. ^ 


^  This  network  has  eight  inputs  (Mach  number,  angle -of-attack,  sideslip,  dynamic  pressure, 
aileron,  horizontal  stabilator,  differential  stabilator,  and  rudder),  27  nodes  arranged  in  a 
single  layer  (but  drawn  in  a  more  compact  form),  and  four  outputs  (contained  in  the  vector- 
valued  signal  "Net").  The  netw'o.rk  outputs  represent  the  learned  contributions  to  the  prediction 
of  the  expected  next  vsdues  of  aircraft  roll  rate,  pitch  rate,  yaw  rate,  and  sideslip,  in  terms  of  the 
current  network  inputa. 


5.6  Simulation  Results 


(  Experimental  results  obtained  from  a  software  simulation  of  the  hybrid  adap- 
tive/leaming  control  methodology,  applied  to  the  attitude  rate  control  problem  on  a 
nonlinear  aircraft  model  and  demonstrated  relative  to  the  S-trajectory  maneuver, 
*  are  summarized  in  this  section.  The  basic  experimental  setup  is  shown  below  in 

Fig.  5.12,  which  is  a  snapshot  of  the  NetSim  project  window.  The  key  software 
modules  are  listed  in  Table  5.2. 


Figure  5.12.  A  Snapshot  of  the  NetSim  Project  Window  Used  in  the  6-DOF  Flight 

Control  Demonstration. 
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Table  5.2.  Main  NetSim  Component  Modules  for  S-Trajectory  Demonstration. 


R/C(6) 

6-DOF,  nonlinear  aircraft  model 

nctuators(6) 

actuator  suite  for  6-DOF  aircraft 

"S"  Trajectory 

open-loop  guidance  command  generator 

RateRef 

performance  (reference)  model  w/  error  dynamics 

RateCtri 

attitude  rate  controller 

Beta  Controller 

sideslip  controller 

S-Traj  Net 

linear-Gaussian  network  used  for  learning 

Gust  Model 

Dryden  model  wind  gust  generator 

5.6.1  S-Traiectorv  Maneuver 

The  S-trajectory  maneuver  used  to  demonstrate  the  hybrid  adaptive/learning  con¬ 
trol  system  in  a  coupled,  multi-axis  flight  control  scenario  is  outlined  below,  in 
Table  5.3.  This  maneuver  is  similar  to  one  described  in  [Stevens  &  Lewis  (1992)]. 
Starting  from  a  wings-level,  trimmed  flight  condition  at  an  altitude  of  5,000  ft  and 
an  airspeed  of  987  ft/s  (Mach  0.9),  guidance  commands  are  issued  in  an  open-loop 
fashion.  As  outlined  in  Section  5.5,  the  overall  control  augmentation  system  con¬ 
sists  of  an  outer-loop  sideslip  regulator  and  an  inner-loop  angular  rate  tracker. 
Thus,  there  are  two  explicit  exogenous  inputs  to  the  control  augmentation  system, 
P,,com  9..eom»  which  are  specified  by  the  guidance  command  generator; 

additionally,  p  =  0  is  assumed  to  be  an  implicit  commimd. 


Table  5.3.  S-Trajectory  (Open-Loop)  Guidaiice  Commands. 


time  (s) 

guidance  command 

0 

accelerate  forward;  hold  throttle  at  full  afterburner 

5 

initiate  pitch  pull-up  (10  deg/s) 

21 

initiate  roll  right  about  stabib'ty-eixis  (60  deg/s) 

24 

terminate  roll  right 

c® 

terminate  pull-up 

41 

initiate  roll  left  about  stability-axis  (60  deg/s) 

44 

terminate  roll  left 

60 

terminate  maneuver 
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The  S-trajectory  is  a  difficult  and  challenging  maneuver  for  several  reasons 
(particularly  given  the  capabilities  of  the  specific  aircraft  model  and  actuator  suite 
used).  The  overall  maneuver  takes  the  aircraft  through  a  variety  of  different 
I  flight  regimes  during  the  course  of  its  execution.  A  "side  view"  (altitude  vs. 
ground  track)  of  the  maneuver  is  shown  in  Fig.  5.13;  this  perspective  clearly  illus¬ 
trates  why  the  maneuver  is  referred  to  as  the  "S-trajectory."  Note  that  in  Fig. 

4 

*  5.13,  as  well  as  in  all  subsequent  figures  relating  to  the  S-trajectory,  the  results 

shown  are  for  the  hybrid  attitude  rate  control  system,  with  adaptive  and  learning 
augmentation,  after  learning  has  occurred. 

Figs.  5.14-5.18  provide  additional  perspectives  that  are  useful  for  characterizing 
the  S-trajectory  maneuver:  Fig.  5.14  shows  the  Euler  angles  associated  with  the 
maneuver;  Fig.  5.15  is  a  plot  of  altitude  vs.  airspeed;  Fig.  5.16  shows  angle-of-at- 
tack  and  sideslip  as  a  function  of  time;  Fig.  5.17  shows  load  factor  vs.  time;  and 
finally.  Fig.  5.18  shows  dynamic  pressure  vs.  time. 

It  should  be  clear  from  these  plots  that  the  S-trajectory  is  a  complex  maneuver. 
For  instance,  altitude  ranges  from  5,000  to  22,000  ft,  while  airspeed  ranges  from 
370  ft/s  to  1070  ft/s.  Over  this  spectrum,  the  dynamic  pressure  (which  is  a  strong 
determinant  of  the  effectiveness  of  the  aerodynamic  control  surfaces)  that  the  air¬ 
craft  experiences  varies  from  80  lbs/ft2  to  1160  Ibs/ft^,  with  the  low  point  coming 
near  the  most  difficult  point  in  the  maneuver  (around  f  =  40  s,  when  the  pitch 
pull-up  is  terminated,  and  the  second  roll  is  initiated).  During  this  part  of  the 
maneuver,  the  angle-of-attack  plummets  from  aroxmd  20  deg  to  -€  deg  and  load 
factor  goes  negative.  In  fact,  both  the  angle-of-attack  and  load  factor  are  negative 
during  the  second  roll  180  deg  roll. 


V 
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Hybrid:  Euler  Angles 


A  n 


Hybrid:  Aititud<3  vs.  Airspeed 


Figure  5.15.  S-Trajectory  Using  the  Hybrid  Contrcller:  Altitude  vs.  Airspeed. 


Hybrid:  Angte-of-Attack  &  Sidesli 


Time  (sec) 


d:  Load  Factor 


Hybrid:  Dynamic  Pressure 
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Figure  5.18,  S-Trajectory  Maneuver  Using  the  Hybiid  Controller:  Dynamic  Pressure  vs.  Time. 


The  hybrid  adaptive/learning  control  system  was  trained  by  exposing  it  to  approx¬ 
imately  4,300  instances  of  the  S-trajectory  maneuver.  During  roughly  half  of 
these  trials,  Dryden  model  wind  gusts  [MIL-STD-1797A  (1990)]  were  used  to  gen¬ 
erate  disturbances  to  the  otherwise  deterministic  vehicle  dynamics.  Although  the 
performance  of  the  learning  system  in  this  case  (i.e.,  its  ability  to  accurately  syn¬ 
thesize  the  desired  unmodeled  dynamics  in  an  efficient  manner,  given  the  avail¬ 
able  resources  and  experiential  data)  was  adequate  for  the  purposes  of  this  inves¬ 
tigation,  we  believe  that  much  more  efficient  methods  are  possible.^ 

Initially,  before  any  learning  has  occurred,  the  performance  of  the  hybrid  control 
system  is  identical  to  that  of  the  control  system  with  adaptive  augmentation  only. 
In  this  case,  the  adaptive  control  system  acting  on  its  own  is  able  to  complete  the 
maneuver,  but  tracking  performance  is  not  very  good  and,  moreover,  repeated 
trials  do  not  make  the  adaptive  controller  perform  better.  With  additional  experi¬ 
ence,  the  hybrid  controller  is  able  to  perform  better  than  the  adaptive  system,  due 
to  the  incorporation  of  learning.  Note  that  the  a  priori  control  system  alone  (with¬ 
out  adaptive  nor  learning  augmentation)  is  unable  to  control  the  vehicle  well 
enough  for  the  maneuver  to  be  completed.  In  each  case,  the  only  a  priori  model 
information  used  to  design  the  controllers  was  a  single,  low-order  linearizat’on  of 
the  actual  nonlinear  aircraft  dynamics  at  a  trimmed  flight  condition  correspond¬ 
ing  to  an  altitude  of  5,000  ft  and  an  airspeed  of  600  ft/s,  together  with  the  rigid-body 
d3’namics  that  appears  in  (5.8). 

Fig.  5.19  shows  the  tracking  performance  of  the  hybrid  controller  (after  learning 
has  occurred)  for  the  stabihty-axis  pitch  rate  command.  In  this  figure  (as  well  as 
the  next  two),  three  curves  are  plotted:  a  command  signal,  the  corresponding  sig¬ 
nal  output  from  the  reference  model  dynamics,  and  the  actual  response  of  the 
nonlinear  vehicle  under  hybrid  control.  Perfect  tracking  would  result  if  the  ac¬ 
tual  signal  matched  that  output  from  the  reference  model.  Fig.  5,20  shows  the 
stability- axis  roll  rate  tT-acking  performance,  and  Fig.  5.21  shows  the  stability- axis 
yaw  rate  tracking  performance. 


^  For  example,  variable  structure  learning  methods  could  have  been  employed  Some  potential 
iinprovenreiits  that  might  be  made  to  the  learning  system  art  discussed  in  Section  6.2. 
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In  Fig.  5.21,  the  "command"  signal  is  not  one  of  the  two  exogenous  inputs 
provided  by  the  open-loop  command  generator,  and  instead  is  gem;rated  by  the 
outer-loop  sideslip  controller.  Note  that  the  sideslip  controller  was  designed 
under  the  assumption  that  the  inner-loop  attitude  rate  tracker  was  perfect  in  the 
sense  that  it  had  no  error  and  no  lag.  Of  course,  the  actual  inner-1  jop  controller 
is  not  perfect,  and  so  this  "design  separation"  assumption  is  violated.  As  a  result, 
the  performance  of  the  hybrid  controller  with  respect  to  yaw  rate  tracking  is  not  as 
good  as  it  is  for  roll  and  pitch  rate  tracking. 

In  point  of  fact,  yaw  rate  tracking  (about  the  stability-axis)  was  apt  an  explicit  goal 
of  the  controF  aiigmentation  system;  instead,  it  was  a  means  for  achieving  the  ex¬ 
plicit  goal  of  sideslip  regulation.  Thus,  the  actual  tracking  performance  of  the 
hybrid  control  system  should  be  judged  in  terms  of  Figs.  5.16,  5.19,  and  5.20,  In 
Fig.  5.16,  the  sideslip  command  should  be  taken  to  be  identically  zero,  throughout 
the  course  of  the  maneuver. 

It  should  be  clear  from  Figs.  5.16,  5.19,  and  5.20  that  the  tracking  performance  of 
the  hybrid  adaptive/learning  is  excellent,  especially  given  the  limited  a  priori 
model  infoimation  available  to  it  and  the  difficulty  of  the  S-trajectory  maneuver. 
A  direct  comparison  of  the  performance  of  the  adaptive  and  hybrid  control  sys¬ 
tems  relative  to  a  near-ideal  controller  will  be  presented  later  in  this  section. 
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Hybrid:  Pitch  Rate  Tracking 
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Time  (sec) 


Hybrid:  Roll  Rat3  Tracking 
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Figure  5.20.  S-Trajectory  Maneuver  Under  Hybrid  Control:  Stability-Axis  Roll  Rate  Tracking 


Hybrid:  Yaw  Rate  Tracking 


Figure  5.21.  S-Tr^ectory  Maneuver  Under  Hybrid  Control:  Stability-Axis  Yaw  Rate  Tracking 


To  execute  the  S-trajectory  maneuver,  four  different  aerodynamic  control  surface 
effectors  were  used  by  the  hybrid  control  system:  differential  aileron,  symmetric 
horizontal  stabilator,  differential  stabilator,  and  rudder.  Figs.  5.22  through  5.25 
show  the  control  effector  usage  for  these  surfaces,  respectively,  during  this  ma¬ 
neuver. 

In  each  case,  the  control  signals  were  subject  to  the  position  and  rate  satxiration 
limits  of  the  actuators  (as  discussed  in  Section  5.2).  As  can  be  readily  seen,  only 
the  ailerons  saturated  (around  i  =  42  s,  during  the  second  roll).  As  mentioned 
previously,  the  effectiveness  of  the  aerodynamic  control  surfaces  changes  radi¬ 
cally  over  the  course  of  the  S-trajectory  maneuver.  In  terms  of  the  signals  shown 
in  these  figures,  one  can  observe  this  effect  by  noticing  that  the  magnitude  of  the 
control  signals  reqviired  to  perform  the  maneuver  is  greatest  when  the  dynamic 
pressure  is  lowest  (roughly  from  ^  =  35  to  45  s). 


Hybrid:  Aileron 


Fiffure  5.22.  S-Traiectorv  Maneuver  Under  Hybrid  Control:  Aileron  Response 


Hybrid:  Symmetric  Stabilator 


Figure  5.23.  S-Trajectory  Maneuver  Under  Hybrid  Control:  Symmetric  Stabilator  Response 


Hybrid:  Differentia]  Stabilator 


Figure  5.24.  S-Traiectorv  Maneuver  Under  Hybrid  Control:  Differential  Stabilator  Re?  '>ons€ 


The  three  controllers  developed  and  applied  to  this  problem  (lineai-,  adaptive,  and 
hybrid)  were  all  derived  using  the  same  a  priori  model  information  (i.e.,  a  single, 
constant  parameter,  low-order  linearization  of  the  actual  nonlinear  aircraft  dy¬ 
namics  without  actuator  or  engine  dynamics,  together  with  some  knowledge  of 
the  rigid-body  dynamics — not  aerodynamics — relating  sideslip  and  yaw  rate).  A 
linear  compensator  was  designed  using  only  this  a  priori  information.  This  lin¬ 
ear  compensator  performed  so  poorly  that  it  could  not  be  used  to  perform  the  S- 
trajectory  maneuver  (the  vehicle  would  depart  from  controlled  flight). 

The  adaptive  controller  was  developed  as  an  extension  of  the  linear  compensator, 
by  using  adaptive  augmentation  to  provide  an  improved  on-line  model  of  the  non¬ 
linear  aircraft  d3mamics.  A  simple  adaptive  method  was  incorporated  that  al¬ 
lowed  the  vehicle  to  complete  the  S-trajectory  maneuver,  albeit  with  substantial 
tracking  errors.  Similarly,  the  hybrid  controller  was  developed  as  an  extension  to 
the  adaptive  controller  by  using  leauming  augmentation  to  provide  an  even  better 
on-line  model  of  the  actual  nonlinear  aircraft  dynamics.  Prior  to  learning,  the 
hybrid  controller  performed  identically  to  the  adaptive  controller;  after  a  small 
period  of  training,  the  hybrid  system  was  able  to  perform  exceptionally  well. 

The  only  real  difference  between  these  three  controllers  was  in  their  ability  to 
identify  and  predict  the  unmodeled  dynamics  of  the  aircraft.  In  particu-ar,  all 
three  employed  the  same  overall  control  system  architecture,  and  the  same  on¬ 
line  control  selection  scheme.  The  linear  compensator  was  of  a  fixed  desi,gn,  and 
could  not  improve  its  model  on-line.  Both  the  adaptive  and  learning  augmented 
controllers  were  able  to  update  their  models  on-line.  Thus,  the  most  direct  way  to 
compare  the  performance  of  the  controllers  is  to  examine  their  ability  to  predict 
the  behavior  of  the  actual  nonlinear  aircraft  in  terms  of  the  four  outputs  of  inter¬ 
est:  roll  rate,  pitch  rate,  yaw  rate,  and  sideslip.  Since  the  linear  compensator  was 
unable  to  execute  the  maneuver,  it  was  excluded  from  the  comparison.  In  addi¬ 
tion,  so  as  to  gauge  the  relative  performance  that  might  be  obtained  under  condi¬ 
tions  of  near  perfect  learning,  an  "ideal"  hybrid  controller  was  constructed  by  re¬ 
placing  the  learning  system  network  with  modules  derived  from  the  actual  non¬ 
linear  aircraft  d^mamics.  Subsequently  the  root-mean-square  (RMS)  value  of  the 
prediction  errors  for  the  four  outputs  were  computed  over  S-trajectory,  These  re- 


suits  are  summarized  in  the  bar  chart  shown  in  Fig.  5.26,  Note  that  the  predic¬ 
tion  errors  in  the  case  of  "ideal"  augmentation  are  not  quite  zero  due  to  the  pres¬ 
ence  of  some  effects  (e.g.,  actuator  position  and  rate  saturation)  which  could  not 
easily  be  accounted  for  and  were  hence  ignored. 


RMS  Value  of  Prediction  Errors 


ADAPTIVE  HYBRID  "IDEAL" 

Controller 


Figure  5.26.  Summary  of  Prediction  Eirors  Using  Adaptive,  Hybrid 
Adaptiv€VLearning,  and  "Ideal"  Augmentation. 

Fig.  5.26  clearly  shows  the  improvement  that  is  possible  through  the  use  of  learn¬ 
ing  augmentation.  The  hybrid  controller  easily  outperforms  the  adaptive  system, 
and  is  even  able  to  outperform  the  "ideal"  case  relative  to  the  sideslip  dynamics. 


6  Conclusion 


Results  obtained  during  this  research  program  clearly  demonstrate  many  of  the 
potential  benefits  of  learning  augmented  control.  At  the  same  time,  however,  is¬ 
sues  uncovered  by  this  investigation  also  suggest  that  further  work  is  needed.  A 
summary  of  the  program  is  provided  below  in  Section  6.1,  while  topics  for  future 
research  and  development  are  discussed  in  Section  6.2.  Note  that  each  attach¬ 
ment  also  includes  its  own  separate  set  of  conclusions  and  recommendations  for 
future  research. 

1  1  Srumnaiy 

This  multiphase  research  program  had  the  broad  aim  of  investigating  the  appli¬ 
cation  of  learning  systems  to  automatic  control  in  general,  and  to  flight  c  -ntrol  in 
particular.  The  first  phase  analy2ed  the  original  drive-reinforcement  learning 
paradigm  and  examined  its  application  to  automatic  control,  with  mixed  resultfj. 
It  was  shown  that  while  the  original  algorithm  showed  promise,  it  nevertheless 
lacked  the  ability  to  function  (alone)  as  a  learning  controller.  The  second  phase 
compared  a  number  of  alternative  control  strategies  including  conventional  lin¬ 
ear  control,  adaptive  control,  as  well  as  other  reinforcement  learning  control 
methods  (e  g.,  those  developed  by  Barto,  Sutton,  et  al.).  No  candidate  was  found  to 
dominate  the  field,  and  none  was  perceived  to  be  suitable  for  application  to  flight 
control.  During  this  same  period,  a  new  hybrid  adaptive/leaming  control  scheme 
was  conceived.  Subsequently,  in  the  third  phase,  the  hybrid  control  approach  was 
more  fully  developed  and  applied  to  several  nonlinear  dynamical  systems,  includ¬ 
ing  a  cart-pole  system,  aeroelastic  oscillator,  and  a  three-degree-of-freedom  high 
performance  aircraft.  Each  application  was  succe.ssful.  The  fourth  phase  revis¬ 
ited  drive-reinforcement  learning  from  the  point  of  view  of  optimal  control  and 
successfully  applied  a  version  embedded  in  the  associative  control  process  archi¬ 
tecture  to  regulate  an  aeroelastic  oscillator.  Analysis  of  this  and  other  similar  re¬ 
inforcement  learning  approaches  indicated  that  the}'  are  be.st  suited  to  problems 
in  optimal  control.  The  fifth  phase  examined  the  problem  of  learning  augmented 
estimation,  and  resulted  in  the  development  of  a  preliminary  estimation  scheme 
that  is  consistent  with  the  hybrid  adaptive/learning  control  approach.  In  the 
sixth  and  final  phase,  the  hybrid  control  methodology  was  applied  to  a  nonlinear, 
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six-degree-of-freedom  flight  control  problem,  and  then  successfully  demonstrated 
via  a  challenging,  multiaxis  "S-trajectory,"  maneuver. 

Throughout  this  vrork,  a  key  concept  underlying  our  approach  is  the  view  that 
"learning"  can  be  interpreted  as  the  automatic  synthesis  of  multivariable  func¬ 
tional  mappings,  based  on  experiential  information  that  is  gained  incrementally 
over  time.  When  combined  with  adaptation,  the  resulting  hybrid  control  strategy 
provides  a  sophisticated  control  system  design  and  implementation  technique. 
Moreover,  our  work  emphasizes  real-time  (on-hne)  adaptation  and  learning,  and 
considers  the  overall  problem  to  have  four  fundamental  elements;  adaptation, 
learning,  control,  and  estimation.  For  flight  control  applications,  advanced  con¬ 
trol  systems  incorporating  learning  might  be  used  advantageously  to: 

•  facilitate  the  control  system  design  and  tuning  process 

•  accommodate  initially  unmodeled  dynamics 

•  improve  performance  through  on-line  self-optimization 

•  improve  control  usage  and  efficiency  relative  to  purely  adaptive  approaches 
The  bottom  line  is  that  learning  augmentation  is  beneficial  to  automatic  control  in 
general  and  to  flight  control,  in  particular. 

In  conclusion,  the  main  accomplishments  of  this  program  include: 

•  analysis  of  the  D-R  learning  paradigm  and  ACP  network  architecture  in  the 
context  of  control  and  optimal  control,  respectively 

•  motivation  for  and  identification  of  issues  underlying  the  application  of 
learning  to  flight  control 

•  conception  and  development  of  the  hybrid  adaptive/leaming  control  and 
estimation  methodology 

•  application  of  a  variety  of  learning  sya^  ?ms  to  the  control  of  a  many  different 
nonlinear  dynamical  systems  (e.g.,  cart-pole  system  on  split-level  track, 
aeroelastic  oscillator,  and  3-DOF  high  performance  aii*craft) 

•  development  of  a  6-DOF  learning  augmented  flight  control  system  and 
demonstration  via  a  multi-axis  maneuver 

•  guidance,  supervision,  and  support  of  3  graduate  student  theses 

•  production  of  11  technical  publications 


6J2  Recoimner.dations  for  Futui^  Work 


Although  significant  progress  was  made  during  this  research  program,  it  is 
clear  that  further  work  is  needed.  Key  topics  for  future  research  and  development 
are  summarized  below. 

6.2.1  Reinforcement  Learning  Systems 

The  current  state  of  scientific  and  engineering  advancement  is  such  that  most 
reinforcement  learning  methodologies  cannot  readily  be  applied  (in  a  practical 
sense)  to  those  complex  control  problems  which  might  warrant  such  approaches. 
Even  so,  the  potential  benefits  associated  with  a  practical  reinforcement  learning 
system  implGmentation  are  significant  and  not  easily  overlooked.  Moreover, 
many  researchers  believe  firmly  in  the  existence  of  such  implementations.  Th  is, 
we  suggest  two  topics  for  futvure  research  and  development  in  this  area. 

Reinforcement  Learning  Applications 

A  closer  examination  of  the  many  connections  between  reinforcement  learning 
and  "classical"  approaches  to  soMng  optimal  control  and  multiplayer  game  prob- 
Kems  is  needed.  The  formulation  of  many  such  problems  fi^  j  the  basic  scenario  of 
reinforcement  learning,  with  the  proviso  that  reinforcement  learning  methods 
are  only  appropriate  for  problems  in  which  there  is  a  significant  level  of  uncer¬ 
tainty  regarding  the  plant  (or  player)  dynamics.^  It  is  perhaps  also  true  that 
those  who  have  been  approaching  such  problems  from  a  classical  engineering 
point  of  view  can  benefit  from  the  perspective  of  more  biologically  motivated  re¬ 
searchers,  and  vice  versa. 

Continuous  Input  /  Output  Reinforcement  Learning  Systems 

One  severe  limitation  of  many  current  reinforcement  learning  methods  is  the 
need  to  discretize  both  the  problem  state  space  and  the  control  action  space.  Often, 

^  For  exampl-j,  ono  might  attempt  to  design  robust  control  systems  (off-line)  using  a  two-player 
game  scenario,  in  wliich  one  player  (the  protagonist)  attempts  to  minimize  some  cost  function 
by  selecting  an  appropriate  conti'o!  law,  wh’le  the  other  player  (tlie  antagonist)  attempts  to 
maximize  the  same  cost  function  by  modifying  plant  and/or  environmental  parameters. 
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only  a  binary  (i.e.,  bang-bang)  contr^'  action  set  is  used.  Moreover,  it  appears  that 
there  may  be  great  dirficulty  in  scaling  such  implementations  up  to  the  point 
where  a  quantized  sj^stem  could  be  effectively  used.  An  important  first  step  to¬ 
wards  making  reinforcement  learning  method'-;  more  useful  to  control  theorists 
and  engineers  would  be  to  de’^elop  s'^stemt  capable  of  continuous  input/output 
spaces  (i.e.,  continuous  state  and  action  spaces).  In  addition,  variable  structure 
learning  (see  below)  is  a  feature  that  almost  certainly  should  be  incorporated  into 
reinforcement  learning  systems. 


There  are  several  opportunities  for  enhancing  both  the  learning  system  and 
training  process  in  a  hybrid  daptive/learning  crntrol  system.  Generally,  these 
upgrades  are  aimed  at:  (i)  making  the  learning  process  more  efficient,  as  well  as 
(ii)  automating  the  learning  system  design  parameter  selection  process — ^which 
is  currently  done  manually. 

Variable  Learning  Rates 

When  on-line  learning  is  used  in  control  applications,  the  system  state  may  re¬ 
main  in  particular  regions  of  its  state-space  for  extended  periods  of  time  during 
training.  Under  these  conditions,  the  approximation  error  should  not  be  expected 
to  tend  uniformly  to  zero  over  the  input-space.  Instead,  the  error  will  be  lowest  in 
those  areas  where  the  greatest  amount  of  experience  has  been  obtained.  This 
condition  leads  to  conflicting  constraints  on  the  learning  rate:  it  should  be  small 
(to  filter  the  effects  of  noise)  in  those  regions  where  the  approximation  error  is 
small,  but  at  the  same  time,  it  should  be  large  (for  fast  learning)  in  those  regions 
where  the  approximation  error  is  large  (relative  to  the  ambient  noise  level).  Reso¬ 
lution  of  this  conflict  is  possible  through  the  use  of  spa  tially  localized  learning 
rates,  where  individual  learning  rate  coefficients  are  maintained  for  each 
(spatially  localized)  region  and  updated  in  response  to  the  local  learning  condi¬ 
tions.  Some  preliminary  work  has  been  performed  in  this  area,  e.g.,  [Jacobs 
(1988);  Berger  (1992)]. 


1^5) 


Variable  Structure  Algorithms 


One  potential  criticism  of  the  use  of  learning  systems  in  control  applications  is 
that  learning  may  proceed  too  slowly,  so  that  too  much  training  time  is  required 
before  the  benefits  can  be  realized.  To  a  large  extent,  slow  leaxrang  rates  are  an 
inherent  attribute  of  large  distributed  networks,  since  much  experiential  infor¬ 
mation  is  needed  to  determine  the  appropriate  value.s  for  the  large  number  of  ad¬ 
justable  parameters.  One  fundamental  approach  for  achieving  more  rapid  learn¬ 
ing  is  to  begin  with  a  small  network,  having  a  few  adjustable  parameters,  and  in¬ 
crementally  "grow"  the  network  by  enlarging  its  structure.  Although  such  a  net¬ 
work  initially  lacks  high  representational  power,  it  will  be  able  to  quickly  capture 
the  main  features  of  the  desired  mapping.  Additional  parameters  (structure) 
may  then  be  added  to  gradually  improve  the  precision  of  the  mapping.  Given  ap¬ 
propriate  logic  for  adding  nodes,  this  approach  should  display  a  high  overall 
learning  rate,  as  the  learning  process  will  proceed  in  a  more  efficient  manner 
than  when  using  standard  gradient-based  training  algorithms  vnth  fixed  net¬ 
work  structures.  Until  recently,  such  "variable  structure"  algorithms  were  gen¬ 
erally  incompatible  with  the  incremental  training  reqmrements  of  on-line  control 
system  applications.  Tlicse  issues  have  recently  been  addressed  to  some  extent  in 
[Cerrato  (1993)]. 

Lgaming  for  .Flight  Control 

Four  potential  applications  of  learning  to  flight  control  are  briefly  outbned  in  this 
section.  It  is  perceived  that  the  use  of  learning  in  these  insteuices  might  prove  to 
be  advantageous,  particularly  if  a  learning  system  were  also  employed  to  fulfill 
the  main  learning  augmented  control  function  discussed  in  pre’hous  chapters. 

Learning  Augmented  Estimation 

Successful  state  estimation  typically  requires  an  accurate  model  of  the  system. 
Obviously,  learning  can  be  used  to  facilitate  the  estimation  process  if  this  process 
is  allowed  to  utilize  the  modeling  inforaiation  provided  by  a  learning  system  The 
use  of  learning  to  provide  additional  model  information  should  prove  to  be  more 
robust  to  measurement  noise  than  if  adaptation  alone  was  used.  Once  learning 
has  occuiTed,  network  evaluations  provide  stored,  time-averaged  (and  mere  noise 
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robust)  modeling  information,  whereas  adaptation  (dealing  only  with  recent  tem¬ 
poral  sequences)  can  only  gain  robustness  to  noise  by  filtering  these  signals 
(which  inherently  introduces  lag  into  the  estimation  process).  The  material  pre¬ 
sented  in  Section  4.3  only  represents  a  first  step  in  this  direction — there  is  much 
room  for  further  analysis  and  development. 

Learning  an  Inverse  Model 

Recall  that  accurate  output  tracking  in  the  basic  hybrid  control  approach  requires 
solving  a  system  of  algebraic  equations  for  the  control  variables  to  obtain  the  de¬ 
sired  output  at  the  subsequent  time-step.  When  the  plant  model  is  linear,  this  so¬ 
lution  involves  inverting  a  constant  matrix  that  represents  input/output  control 
effectiveness.  However,  in  the  general  nonlinear  case,  the  tracking  problem  re¬ 
quires  solving  a  nonlinear  system  of  algebraic  equations  (as  the  controls  no  longer 
enter  linearly).  In  fact,  this  nonlinear  problem  may  have  many  solutions,  or  no 
solution  at  all,  depending  on  the  nonlinearities  involved.  Once  a  nonlinear  track¬ 
ing  equation  had  been  solved  at  a  given  flight  condition,  it  might  be  desirable  to  re¬ 
tain  this  solution,  (jeneration  of  the  solution  could  then  be  expedited  in  the  future 
if  a  flight  condition  in  the  same  vicirdty  were  encoimtered.  In  fact,  it  would  be  in¬ 
efficient  to  numerically  solve  the  same  nonlinear  problem  twice — especially  if 
multiple  iterations  were  required.  The  generalization  implied  by  these  ideas  be¬ 
comes  tantamount  to  learning  an  inverse  mapping.  Such  an  approach  is  gener¬ 
ally  possible  if  and  only  if  the  solution  to  the  nonlinear  tracking  problem  is 
unique. 

Learning  the  Trim  Manifold 

Inaccurate  knowledge  of  the  tr’m  manifold  can  be  interpreted  as  a  form  of  model 
error.  Adding  integrators  increases  controller  robustness  to  such  modeling  er¬ 
ror,  but  inherently  results  in  slower  rates  of  convergence  (since  a  finite  time  is  re¬ 
quired  for  the  integi'al  states  to  have  an  impact  on  the  control  signal).  By  using 
learaing  to  provide  the  autopilot  with  a  more  accui’ate  trim  description  (in  a  feed¬ 
forward  sense),  less  integral  compensation  may  be  required,  thereby  improving 
the  convergence  rate  and  ultimately  resulting  in  a  better  autopilot  design.  Simi¬ 
larly,  the  state  estimation  process  can  benefit  from  a  more  accurate  characteriza¬ 
tion  of  the  trim  manifold. 
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Optimizing  the  Reference  System 


In  model  reference  control  architectures,  the  controller  seeks  optimal  tracking 
performance.  Note  that  even  if  the  control  law  perfectly  achieves  this  tracking  ob¬ 
jective,  one  can  at  best  only  expect  to  match  the  performance  of  the  chosen  refer¬ 
ence  system— -thus,  the  reference  model  effectively  represents  em  "upper-bound" 
on  the  closed-loop  system  performance.  In  such  a  control  paradigm,  global  sys¬ 
tem  performance  must  be  addressed  through  the  design  of  the  reference  system. 
Generally,  one  would  like  to  construct  the  best  possible  reference  system  that  is 
consistent  with  the  actual  control  capabilities  of  the  system.  Furnishing  an  over- 
ambitious  reference  model  might  introduce  instability,  whereas  selecting  a  con¬ 
servative  reference  system  will  result  in  suboptimal  closed-loop  performance.  For 
general  nonlinear  systems,  performance  will  be  regime  dependent — the  system 
may  be  capable  of  better  performance  in  some  regions  of  the  operating  envelope 
than  in  others.  Typically,  such  variations  are  not  accurately  known  a  priori,  so 
that  a  conservative  design  may  result.  Through  on-line  experience  with  the  ac¬ 
tual  system,  the  controller  can  learn  to  identify  troublesome  operating  regimes 
and  relax  the  reference  model  expectations  accordingly.  Likewise,  the  reference 
response  can  be  made  more  ambitious  in  those  regimes  where  it  is  appropriate  to 
do  so. 
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ABSTRACT 


Connectionist  learning  systems  may  be  considered  to  be  automatic  function 
approximation  systems  which  learn  from  examples,  and  have  received  an  increase  in 
interest  in  recent  years.  They  have  been  found  useful  for  a  number  of  tasks,  including 
control  of  multi-dimensional,  nonlinear,  or  poorly  modeled  systems.  A  number  of 
approaches  have  been  applied  to  control  problems,  such  as  modeling  inverse  dynamics, 
backpropagating  error  through  time,  reinforcement  learning,  and  dynamic  programming 
based  algorithms.  The  question  of  integrating  partial  a  priori  knowledge  into  these 
systems  has  often  been  a  peripheral  issue. 

Control  systems  for  nonlinear  plants  have  been  explored  extensively,  especially 
approaches  based  on  gain  scheduling  or  adaptive  control.  Gain  scheduling  is  the  most 
commonly  used  in  practice,  but  often  requires  extensive  modeling  or  manual  tuning,  and  is 
susceptible  to  modeling  uncertainty  and  time-varying  dynamics.  Adaptive  control 
addresses  these  problems,  but  usually  cannot  react  to  known  spatial  dependencies 
(nonlinearit'es)  quickly  enough  to  compete  with  a  well- designed  gain  scheduled  system. 

This  thesis  explores  a  hybrid  control  approach  that  uses  a  connectionist  learning 
system  to  store  spatial  dependencies  discovered  by  an  indirect  adaptive  controller.  Tlie 
connectionist  system  learns  to  anticipate  the  parameters  estimated  by  the  indirect  adaptive 
controller,  effectively  becoming  a  gain  scheduled  controller.  The  combined  system  is  then 
able  to  exhibit  some  of  the  advantages  of  gain  scheduled  <ind  adaptive  control,  without  the 
extensive  manual  tuning  required  by  traditional  methods.  Subsequently,  a  technique  is 
presented  for  making  use  of  input'output  partial  derivative  infonnation  from  the  network 
Finally,  the  applicability  of  second-order  learning  methods  to  control  is  considered,  and 
areas  of  future  research  are  suggested. 
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1  INTRODUCTION 


1.1  MOnVATION 

The  design  of  effective  automatic  control  systems  for  nonlinear  plants  presents  a 

/ 

difficult  problem.  Because  direct  analytic  solutions  to  such  problems  are  generally 
unobtainable,  various  approumate  solution  methods  must  be  used  (e.g.,  gain  scheduling). 
The  design  problem  is  further  complicated  by  modeling  errors.  If  there  are  significant  plant 
dynamics  that  are  not  included  in  the  desig.n  model,  or  if  the  plant  dynamics  change 
unpredictably  in  time,  then  the  closed-loop  system  can  perform  worse  than  expected  and 
may  even  be  unstable.  Furthennore,  if  the  sensors  are  noisy,  then  filters  will  be  required, 
which  tend  to  make  the  control  system  slow  to  recogtuze  changes  in  the  plant  (fix>m  either 
uiimodeled  or  time-varying  dynamics). 

Traditional  gain  scheduled  controllers  often  require  extensive  manual  tuning  to 
design  and  develop,  and  do  not  deal  well  with  unmodcled  spatial  dependencies, 
disturbances,  or  time-varying  plants.  Adaptive  controllers  can  handle  tliese  difficulties  in 
principle,  but  in  practice  may  adapt  to  spatial  dependencies  so  slowly  that  the  controller  is 
not  as  good  as  a  gain  scheduled  controller  would  be. 

Ill  contrast,  an  "intelligent'  controller  operating  in  a  complex  environment  should  be 
able  to  accommodate  a  certain  degree  of  uncertainty  (e.g.,  from  time-varying  dynamics, 
noise,  and  disturbances).  More  importantly,  it  should  be  able  to  learn  from  experience  to 
anticipate  previously  unknown,  yet  predictable,  effects  (e.g.,  quasi  static  nonlinearities). 
A  possible  solution  to  this  problem  might  be  a  hybrid  adaptive  /  learning  control  system 
which  could  both  adapt  to  disturbances  and  leani  to  anticipate  spatial  nonlinearities. 
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1 .2  PROBLEM  DESCRIPTION 

Traditional  adaptive  control  tends  to  be  inefficient  and  peridnn  p/oorly  with  respect 
to  significant,  unmodeled  spatial  dependencies,  while  traditional  gain  scheduled  control  has 
difficulty  with  poorly  modeled  dynamics.  The  problem  is  to  find  a  system  that  can  contiol 
a  plant  in  the  presence  of  both  simultaneously,  wliile  incorporating  incomplete  and  possibly 
erroneous  (but  not  debilitating)  a  priori  knowledge  of  the  system. 

Sometimes  a  controller  is  required  which  can  force  a  plant  to  follow  some  desired 
reference  trajectory.  This  model  reference  control  problem  is  approached  here  using  both 
traditional  control  techniques  and  learning  systems.  The  approaches  explored  do  not 
require  that  the  reference  trajectory  satisfy  any  special  constraints,  such  as  being  generated 
by  a  linear  system.  The  only  requirement  is  a  well-defined  method  for  calculating  at  each 
point  in  time  the  desired  rate  of  change  of  the  plant  state. 

Few  assumptions  are  made  about  the  plant  itself;  it  can  be  nonlinear,  poorly 
modeled,  and  subject  to  uriprcdictable  disturbances.  The  sensor  readings  from  the  plant 
must  contain  sufficient  information  to  observe  its  state  and  control  it^  but  may  be  noisy  and 
otherwise  incomplete.  For  example  the  plant  may  have  actuator  dynamics  involving 
internal  .'itatc  within  the  actuators  that  is  not  measured  by  any  sensor.  Specifically,  it  can 
have  unknown  dynamics  that  aie  functions  of  both  state  and  time.  The  plant  can  have 
spatial  dependencies,  that  is  nonlinearities  that  are  functions  of  state  and  are  either  static 
or  quasi-static  in  time.  In  addition,  it  can  also  have  temporal  dependencies  which  are 
functions  of  time,  caused  by  disturbances  and  other  short-term,  unpredictable  events. 

Anotiier  important  property  of  the  control  system  is  that  it  be  possible  to  incorpcrate 
any  a  priori  knowledge.  This  should  include  knowledge  about  both  the  behavior  of  the 
plant  in  the  absence  of  any  control  signals,  and  the  effect  of  the  control  signals  on  the  plant. 
Morexiver,  it  is  especially  important  that  errors  in  the  a  prion  informahon  not  cripple  the 
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controller  in  the  long  run.  The  controller  should  be  able  to  eventually  learn  these  errors  and 
compensate  fot  them. 

Various  algorithms  for  connectionist  learning  systems  are  often  proposed  and 
compared  on  very  small  "toy"  problems.  Tlie  error  to  be  minimized  is  usually  defined  as 
the  total  squared  output  error,  summed  over  the  output  for  each  training  example.  The 
problems  arising  in  learaing  control  often  do  not  resemble  these  test  problems,  and  so  it  is 
difficult  to  predict  how  various  proposed  modifications  will  affect  learning  controllers.  The 
problems  in  control  typically  involve  learning  functions  that  map  continuous  inputs  to 
continuous  outputs,  and  these  functions  are  generally  smooth  with  possibly  a  few 
discontinuities.  For  a  control  problem,  the  error  is  defined  as  the  total  squared  error, 
integrated  over  the  entire  domain.  Learning  systems  that  can  quickly  Icam  to  fit  a  function 
to  a  small  number  of  points  may  not  be  able  to  quickly  learn  the  continuous  functions 
arising  in  typical  control  problems. 

Another  important  aspect  of  learning  control  is  the  order  in  which  training  examples 
become  available.  Most  proposed  learning  systems  are  tested  on  learning  problems 
involving  a  fixed  set  of  training  examples,  which  are  all  available  at  the  same  time,  and 
which  can  be  accessed  in  any  order.  In  control  problems,  the  plant  being  controlled  may 
change  its  state  slowly,  or  tend  to  spend  large  amounts  of  time  in  a  small  regions  of  the 
state -space  (c.g.,  near  zin  operating  point).  This  may  cause  the  learning  system  to  receive  a 
large  number  of  similar  training  examples  before  seeing  different  training  examples.  For 
some  learning  systems  this  uneven  ordering  of  training  data  may  not  matter.  For  others,  it 
may  cause  the  system  to  learn  more  slowly  or  to  forget  important  information.  In  any  case, 
this  is  an  aspect  of  learning  control  that  must  be  taken  into  account  when  comparing  various 
lettrning  systems  for  use  in  a  controller. 
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1 .3  THESIS  OBJECTIVES  AND  OVERVIEW 

The  object  of  this  thesis  is  to  find  methods  for  combining  learning  systems  with 
adaptive  systems  in  order  to  achieve  good  control  in  the  presence  of  both  spatial  and 
temporal  functional  dependencies.  Several  methods  are  developed  for  augmenting  the 
estimation  carried  out  by  an  indirect  adaptive  system  with  the  additional  information 
available  from  a  learning  system.  In  addition  to  developing  this  learning  augmented 
estimation,  various  issues  in  the  construction  and  use  of  connectionist  learning  systems 
are  explored  in  this  context. 

Chapter  2,  Background,  gives  some  of  the  important  concepts  and  historical 
development  of  connectionist  systems,  control  systems,  and  approaches  to  using 
connectionist  systems  for  control. 

Chapter  3,  Hybrid  Control  Architecture,  covers  the  adaptive  controller  and 
connectionist  networks  that  are  integrated  into  a  single  hybrid  controller.  Both  the 
indiviaual  components  and  the  final,  integrated  system,  are  motivated  from  current 
problems,  and  arc  described  in  detail. 

Chapter  4,  Connectionist  Learning  for  Control,  covers  some  of  the  difficulties 
associated  with  learning  systems  for  control,  and  describes  the  methods  used  here  to  deal 
with  those  difficulties. 

Chapter  5,  Experiments,  describes  the  various  simulations  performed  and  their 
results.  These  results  are  presented  graphically  and  atie  interpreted  in  relation  to  the  original 
goals. 

Chapter  6,  Conclusions  and  Recommendations,  summarizes  what  has  been 
accomplished,  draws  conclusions,  and  pioints  out  areas  in  which  future  research  should  be 
focused. 

The  bibliography  lists  those  works  which  were  used  in  the  preparation  of  this 
t.hesis,  together  with  other,  related  works. 
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2  BACKGROUND 


The  hybrid  learning  /  adaptive  controller  combines  connectionist  learning  systems 
with  traditional  control  systems,  and  modifies  each  of  these  components  to  improve  the 
ability  of  the  hybrid  to  combine  the  strengths  of  each.  Before  describing  the  hybrid  system 
itself,  it  is  first  necessary  to  cover  some  of  the  important  developments  and  concepts 
relating  to  these  components.  Section  2.1  covers  the  development  of  some  of  the  important 
ideas  in  connectionist  learning  systems,  and  Section  2.2  deals  with  some  of  the  commoii 
approaches  in  traditional  control  theory.  Finally,  Section  2.3  describes  some  of  the 
approaches  that  have  been  taken  in  building  learning  controllers  or  incorporating 
connectionist  learning  systems  into  control  systems. 

2.1  CONNECTIONIST  LEARNING  SYSTEMS 

The  application  of  connectionist  learning  systems  to  problems  in  control  has 
received  considerable  attention  recently.  Such  systems,  usually  in  the  form  of  feedforward 
multilayer  networks,  arc  appealing  because  tlicy  arc  relatively  simple  in  forat.  ca  t  be  tised 
to  realize  general  nonlinear  mappings,  and  can  be  implemented  in  parallel  computational 
hardware.  An  example  of  a  simple  network  is  shown  in  in  figure  2.1.  The  network 
consists  of  nodes  and  connections  between  nodes.  A  node  may  have  several  real-valued 
inputs,  each  of  which  has  an  associated  connection  weight  (al.so  real-valued).  Each  node 
computes  a  nonlinear  function  of  the  weighted  sum  of  its  inputs,  and  then  sends  the  result 
out  along  all  the  connections  leaving  the  node.  Nodes  are  arranged  in  layers,  with  nodes  in 
each  layer  sending  outputs  only  to  nodes  in  subsequent  layers.  In  such  feedforward 
networks,  it  is  easy  to  calculate  network  outputs,  given  a  set  of  inputs. 
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Figure  2.1  A  conncctionist  network 

A  key  feature  of  feedforward  multilayer  networks  is  that  any  piecewise  smooth 
function  can  be  approximated  to  any  desired  accuracy  by  some  arbitrarily  large  network 
having  the  appropriate  weights  [HW89].  Given  the  correct  weights,  a  network  can  be  used 
to  implement  a  nonlinear  function  that  is  useful  for  a  control  application.  The  difficulty  is 
in  finding  the  appropriate  weights.  No  known  algorithm  guarantees  finding  satisfactory 
weights  for  all  layers  of  a  multilayer  network,  and  Minsky  and  Papert  pointed  out  in  1969 
that  the  small  networks  networks  that  are  guaranteed  to  converge  do  not  scale  well  for  some 
large  problems  (MP69].  Many  saw  this  as  an  indication  tiiat  conncctionist  approaches  were 
not  useful  in  general. 

One  event  that  helped  change  this  perception  was  the  development  of  the  error 
Backpropagation  algorithm,  independently  developed  by  Werbos  [Wer74],  Parker  [Par82], 
LeCun  |I.eC87],  and  Rumelhart,  Hinton,  and  Williams  [RlfW86].  Error  back-propagation 
is  a  gradient  descent  algorithm  that  modifies  network  weights  incrementally  to  minimize  a 
particular  measure  of  error.  The  error  is  usually  defined  as  the  sum  of  the  squared  error  in 
the  output  over  the  set  of  inputs.  The  network  functions  are  continuously  differentiable,  so 
it  is  possible  to  calculate  the  gradient  of  the  total  error  with  respect  to  the  weights,  and  to 
adjust  the  w'eights  in  the  direction  of  the  negative  gradient.  As  with  all  gradient  descent 
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optimization  techniques,  there  exists  a  possibility  of  converging  to  a  non-optimal  local 
minimum.  Despite  this,  learning  systems  using  back-propagation  have  been  shown  to  find 
good  solutions  to  various  real  world  problems  including  difficult,  highly  nonlinear  control 
problems.  No  difficulties  due  to  the  presence  of  local  niinima  were  observed  in  any  of  the 
experiments  that  are  describe<1  in  this  thesis. 

Backpropagation  and  many  other  connectionist  learning  algorithms  tend  to  converge 
slowly,  and  so  are  more  useful  for  learning  quasi-static  nonlinear  functions  than  for 
adapting  to  rapidly  changing  functions. 

2.1.1  Single-Layer  Networks 

The  earliest  connectionist  systems  were  single-layer  networks.  Single-layer 
networks  are  networks  that  implement  functions  with  the  property  that  the  function  is  a 
linear  combination  of  other  functions,  and  only  the  weighting  factors  in  that  linear 
combination  change  during  learning.  These  networks  tend  to  be  less  powerful,  but  the 
learning  rules  are  simpler,  and  so  these  fjchitectures  received  the  earliest  attention. 

Perceptrons 

One  of  the  early  connectionist  network  models  was  the  simple  perceptron, 
developed  by  Rosenblatt  [Ros62]  in  the  late  50’s  (as  discussed  in  [RZ86][Sim87]). 
Rosenblatt  coined  the  term  perceptron  to  refer  to  connectionist  systems  in  general, 
including  those  with  multiple  layers  and  feedback.  He  is  most  widely  known  for  the 
development  of  the  simple  perceptron.  A  simple  perceptron  is  a  device  which  takes 
several  inputs,  multiplies  each  one  by  an  associated  integer  called  its  weight,  and  finds  the 
sum  of  these  products.  The  simple  perceptron  has  a  single  output,  and  the  inputs  and 
output  are  each  1  or  - 1.  The  output  is  -1  if  the  weighted  sum  of  the  inputs  is  negative,  and 
1  if  the  sum  is  nonnegative. 

K  the  input  is  thought  of  as  a  pattern  and  the  output  as  a  truth  value,  then  the  simple 
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perceptron  caii  be  Uiought  of  as  a  classifier  which  detcriTunes  whether  or  not  inputs  belong 
to  a  given  class.  Given  a  set  of  input  patterns  along  with  their  correct  classification,  it  is 
sometfmes  possible  to  find  weights  that  will  cause  a  simple  perceptron  to  classify  these 
patterns  correctly.  Specifically,  if  such  a  set  of  weights  exists,  then  Rosenblatt  proved  that 
a  very  simple  algorithm  will  always  succeed  in  finding  those  weights,  learning  only  from 
presentations  of  inputs  and  their  correct  classifications.  The  algorithm  simply  started  with 
arbitrary  weights,  and  repeatedly  classified  training  examples.  Whenever  it  got  a 
classificatiori  wrong,  each  weight  that  had  an  effect  on  the  result  was  incremented  or 

j 

decremented  by  one,  so  as  to  make  the  resulting  sum  closer  to  the  correct  answer. 
Rosenblatt's  "perceptron  learning  theorem"  proving  the  validity  of  this  algorithm  is  one  of 
the  more  influential  results  of  his  research. 

It  is  helpful  to  think  of  the  inputs  to  the  network  as  a  vector  representing  a  point  in 
seme  high-dimensional  space.  The  weighted  sum  of  the  inputs  is  a  hyperplanc  in  that 
space,  and  the  output  from  the  simple  perceptron  will  classify  input  based  on  wliich  side 
of  the  hyperplane  they  lie  on.  This  means  chat  a  single  simple  perceptron  is  only  capable  of 
classifying  inputs  into  one  of  two  linearly  separable  sets,  sets  which  can  be  separated  by  a 
hyperplane.  Although  this  limits  the  power  of  a  single  simple  perceptron,  it  is  stUl  useful  to 
know  that  any  such  classification  can  be  learned  simply  by  training  the  simple  perception 
with  examples  of  correct  classifications. 

This  limitation  on  the  power  of  perceptrons  can  be  overcome  if  the  outputs  of 
several  simple  perceptrons  feed  in  to  another  simpie  perceptron,  thus  forming  a  multilayer 
perceptron.  Rosenblatt  was  able  to  show  that  for  any  arbitrary  desired  classification  of  the 
input  patterns,  there  exists  a  twvvlayer  perception  which  can  act  as  a  perfect  classifier  for 
that  mapping.  Unfortunately,  there  is  no  known  [eaniing  algorithm  that  is  guaranteed  to 
find  the  correct  weights  for  a  multilayer  >  ^jiron  as  there  was  in  the  case  of  tlie  single- 
layer  perceptron.  Mincky  and  Papert,  in  their  1969  book  Perceptrons  [MP69],  analyzed 
single- layer  perceptrons  and  pointed  out  a  number  of  difficulties  with  them.  Simple 
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perceptrons  are  only  able  to  recognize  linearly  separable  classes,  and  so  cannot  calculate  an 
exclusive  OR,  or  recognize  whether  the  set  of  black  bits  in  a  picture  is  connected  or  not. 
The  problem  remains  even  if  the  inputs  to  the  perception  are  arbitrary  functions  of  proper 
subsets  of  the  input  pattern.  Despite  the  interesting  features  of  single-layer  perceptrons, 
their  conclusion  was  that  "there  is  no  reason  to  suppose  that  any  of  these  virtues  cany  over 
to  the  many-layered  version.  Nevertheless,  we  consider  it  to  be  an  important  research 
problem  to  elucidate  (or  reject)  our  intuitive  judgement  that  the  extension  is  sterile"  [MP69]. 
Minsky  later  considered  Perceptrons  to  be  overkill,  an  understandable  react'on  to  excess 
hyperbole  which  was  diverting  researchers  into  a  false  path  [RZ86].  However  at  the  time, 
the  book  was  one  of  the  factors  contributing  to  a  decrease  in  rnterest  in  connectionist 
models  in  general. 


Another  early  system  was  Samuel's  checker  playing  program  [Sam59][Sam67]. 
This  was  the  first  program  capable  of  playing  a  nontrivial  game  well  enough  to  compete 
well  with  humans,  and  it  was  an  important  system  because  it  introduced  a  number  of  new 
ideas.  It  used  both  book  (table)  lookup  and  game-tree  searches,  and  was  the  first  program 
in  which  the  now  common  procedure  of  alpha-beta  pruning  was  used.  It  also  had  a 
learning  component  which  was  not  referred  to  as  a  neural  network  or  connectionist  system 
at  the  time,  but  wliich  strongly  resembles  many  such  systems. 

The  piogram  chose  its  move  in  checkers  by  sc  archil  g  a  gaine-Lee  to  some  depth 
and  picking  the  best  move.  Alpha-beta  pniifing  and  otlier  subtleties  were  used  to  make  the 
search  more  efficient,  but  the  basic  component  needed  to  make  it  work  was  a  function  that 
could  compare  the  desirability  of  reaching  each  of  several  possible  board  positions.  Given 
an  exhaustive  search,  this  scoring  function  could  be  as  simple  as  "chcxjse  a  move  that 
ensures  a  >vin  if  ixjssiblc;  otherwise  avoid  a  ’oss."  Since  Samuel  could  oDly  search  a  small 
number  of  moves,  the  scoring  function  was  very  important,  and  so  he  built  it  to  combme 
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the  best  a  priori  knowledge  he  could  find  with  additional  knowledge  found  by  the  program 
through  learning. 

Tlie  a  priori  knowledge  which  Samuel  started  with  was  a  set  of  hcuiistic  functions 
derived  from  a  knowledge  of  what  good  human  players  consider  important.  For  example, 
one  such  function  was  the  number  of  pieces  each  player  had  on  the  board;  another  was  how 
many  possible  moves  the  computer  had  available  to  choose  between.  Each  of  these 
functions  were  hand  built  to  have  a  good  chance  of  being  significant,  to  be  quick  and  easy 
to  calculate,  and  to  return  a  single  number  instead  of  a  vector  or  a  symbol.  The  scoring 
function  was  simply  a  linear  combination  of  each  of  the  outputs  of  these  functions.  Samuel 
referred  to  this  linear  function  as  a  polynomial.  The  learning  system  was  designed  to  pick 
the  functions  that  would  be  included  in  the  linear  combination,  and  to  pick  weights  for 
these  functions. 

All  of  the  weights  were  initially  set  to  arbitrary  values.  The  program  could  then 
play  games  agiiinst  a  copy  of  itself,  where  only  one  of  the  two  copies  would  learn  during  a 
given  game.  The  score  for  a  board  position  represented  the  expected  outcome  of  the  game. 
If  the  score  on  the  next  turn  was  different,  then  the  later  score  can  be  assumed  to  be  more 
accurate  than  the  earlier  score,  since  it  is  based  on  looking  fartlicr  ahead  in  the  game. 
Therefore  the  weights  would  ail  be  mtxlified  slightly  so  that  the  earlier  score  would  more 
nearly  match  the  later  score.  The  polync^nial  had  some  fixed  terras  that  were  never 
changed  by  learning,  which  ensured  tliat  the  score  of  a  board  at  ihc  end  of  the  game  would 
always  be  accurate,  preferring  wins  to  losses.  The  process  described  here  is  very  similar  to 
how  the  perceptror.  learned,  changing  weights  slightly  on  each  lime  step  so  as  to  decrease 
error.  There  were  other  important  aspects  of  Samuel's  algorithm  beyond  this,  such  as 
occasionally  rakidomly  changing  the  function  to  escape  local  minima,  but  the  core  of  the 
learning  process  was  this  simple  hill  climbing  algorithm. 

Although  Samuel  said  he  was  avoiding  tlic  "Neural-Net  Approach"  in  his  program 
by  including  a  priori  information  and  learning  rules  specific  to  gaxi\es,  the  ideas  which  he 
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developed  are  similar  in  many  ways  to  much  later  systems  for  multilayer  networks,  optimal 
control,  and  reinforcement  learning  described  below.  His  ideas  influenced  the  work  of 
Micliie  and  Chambers’  Boxes  [MC68]  and  Sutton’s  Temporal  Difference  (TD)  and  Dyna 
learning  [Sut88][Sut90][BSW89][BS90].  Samuel’s  algorithm  can  actually  be  seen  as  a 
type  of  incremental  dynamic  programming  rWB90]. 

ADALINE  and  MADAUNE 

A  third  system  which  was  developed  in  the  late  1950's  was  Widrow's  ADALINE 
and  MAD  ALINE  [Wid89].  He  developed  a  type  of  adaptive  filter  which  is  still  in 
widespread  use  today  in  such  items  as  high-speed  modems.  It  worked  by  multiplying 
several  signals  by  weights,  summing  them,  looking  at  the  output,  and  then  adjusting  the 
weights  according  to  the  errors  in  the  output.  His  training  data  was  analog  and  noisy  and 
came  from  changing  signals,  but  for  the  most  part  his  filters  were  similar  to  the  perceptrons 
or  polynomial  scoring  functions  described  above.  When  weights  were  changed  in 
proporrion  to  their  effect  on  the  error,  and  when  the  changes  became  smaller  over  time, 
Widrow  proved  that  the  weights  were  guaranteed  to  converge.  He  then  went  on  to  add  a 
squashing  function  to  the  output  of  one  of  liis  filters,  forcing  the  output  to  +1  or  -1  on  each 
time  step,  and  userl  it  for  pattern  recognition.  This  "Adaptive  Linear  Neuron"  (ADALINE) 
[Wid89]  was  then  built  in  actual  hardware,  where  weights  were  represented  by  the 
electrical  resistance  of  copper  coated  graphite  rods,  and  learning  was  accomplished  by 
causing  more  copper  to  come  out  of  solution  and  plate  the  rods.  WTien  the  the  outputs  of 
multiple  ADiALINE's  were  fed  into  another  ADALINE,  tliis  formed  what  Widrow  called  a 
MADALiNE  (for  multiple  ADALINES).  By  doing  this,  he  was  able  to  get  around  the 
problem  of  only  learning  linearly  separable  functions.  However,  he  did  not  have  a  method 
for  training  the  weights  that  connected  the  first  set  of  ADALINE’s  to  the  last  one,  so  he 
simply  fixed  all  the  weights  .at  a  value  of  one. 
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2. 1 .2  Multilayer  Networks 

As  can  be  seen  in  the  above  descriptions,  a  number  of  researchers  were  developing 
very  similar  systems  in  the  late  50's  and  early  60's,  some  of  which  generated  a  great  deal 
of  excitement.  The  particular  difficulties  pointed  out  in  Perceptrons  could  not  be  overcome 
as  long  as  the  output  of  the  device  was  simply  a  function  of  a  linear  combination  of  the 
inputs.  A  second  layer  needed  to  be  added  that  would  take  its  inputs  from  the  outputs  of 
the  first  layer.  Widrow  added  a  second  layer  in  the  MAD  ALINE,  but  was  unable  to  train 
all  of  the  weights.  The  pr  iblem  of  multilayer  learning  was  one  of  the  reasons  that  interest 
in  connectionism  tended  o  wane  until  its  resurgence  in  the  late  80’s. 

Hebbian  Learning 

In  1949,  Hebb  pn  posed  a  simple  model  of  learning  based  on  his  studies  of 
bio'  o^cal  neurons.  A  n  uron  in  this  model  would  generate  an  output  that  was  some 
function  of  the  weighted  sum  of  its  inputs.  Unlike  the  models  described  above,  these 
weights  would  ea.n  wi  hout  any  external  training  signal  at  all  The  learning  occuired 
according  to  tN  ;  He  .)bir  i  Learning  Rule,  w  licb  staled  that  the  elficacy  of  a  plastic  synapse 
increased  wf  ‘  lev  the  synapse  was  active  in  conjunction  with  activity  i  f  the  postsynaptic 
neuron.  This  meant  tfiat  the  weight  of  a  connection  in(  leased  wht  nev  er  both  connected 
neuron.s  had  liigh  outputs  at  approximately  the  same  lime,  and  decrea  ed  when  only  one  of 
them  did. 

1  e  basic  Hebbian  model  has  been  refined  in  various  ways  over  the  years  to 
improve  both  its  ability  to  model  animal  behavior,  and  its  ability  to  perform  useful 
funetK  ns  in  systems  such  as  contiullcrs.  One  important  development  in  this  line  of 
research  is  Klopfs  drive-reinfoiceineni  model  [Klo88).  Ir  this  model,  three  major 
niixlirications  are  made  to  tlie  ba.sic  Hebbi  n  auxiel. 
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First,  instead  of  correlating  the  output  of  one  neuron  with  the  output  of  another,  the 
correlation  is  made  between  changes  in  outputs.  If  signal  levels  are  thought  of  as  drives, 
such  as  hunger,  then  it  does  not  make  sense  for  the  network  to  change  weights  merely  on 
the  basis  of  the  existence  of  these  drives.  However,  when  a  signal  level  changes,  such  as 
would  happen  when  hunger  is  relieved  by  eating,  or  pain  is  increased  due  to  damage  being 
done  to  an  animal,  then  the  network  should  change.  The  second  modification  is  to  correlate 
past  inputs  (or  changes  '  i  inputs)  with  current  outputs  (or  changes  in  outputs).  This 
generally  allows  the  network  to  leam  to  predict,  which  a  purely  Hebbian  network  is  unable 
to  do.  The  third  modification  is  to  always  modify  weights  in  proportion  to  the  current 
weight  value.  This  causes  learning  to  follow  an  "S"  shaped  curve.  At  first,  a  given  w'eight 
increases  slowly,  then  grows  more  rapidly,  and  finally  slows  down  again  and  approaches 
an  asymptotic  value.  This  result  is  more  consistent  with  the  result  of  experiments  with 
learning  in  animals. 

This  mode!  has  proven  accurate  in  reproducing  a  wide  range  of  actual  animal 
learning  experiments.  For  example,  it  is  posisible  to  simulate  Pavlov’s  results  in  classical 
conditioning.  A  single  neuron  can  be  given  one  input  representing  the  ringing  of  a  bell, 
and  another  input  representing  the  taste  of  meat  juice.  If  the  output  of  the  neuron  is 
interpreted  as  the:  salivation  response  of  Pavlov's  dogs,  tnen  the  system  can  be  seen  to 
slowly  become  classically  conditioned,  learning  to  salivate  in  response  to  the  bell  with  an 
"S  "  shaped  curved.  When  the  meat  juice  stimulus  is  removed,  it  demonstrates  extinction  of 
the  response  in  a  manner  which  is  also  realistic. 

Drive-reinforcement  learning  has  also  been  applied  to  control.  Multiple  drive- 
reinforcement  neuroiis  have  l)een  connected  with  other  components  to  form  controllers  for 
traditional  control  problems,  as  well  as  for  the  problem  of  traversing  a  maze  to  the  reward 
at  the  end.  This  is  especially  interesting  in  light  of  the  fact  that  each  individual  neuron  is 
not  trying  to  explicitly  minimize  an  error,  as  in  t,!ie  other  controllers  discussed  here. 
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Back.propagatiQJt 

One  oi  the  major  contributing  factors  to  the  return  of  widespread  interest  in 
connectionisl  systems  is  the  development  of  the  Error  Backpropagation  algorithm.  The 
basic  id  r  is  simple.  A  network  consists  of  a  set  of  inputs,  a  set  of  outputs,  and  a  set  of 
nodes  which  calculate  an  output  as  a  function  of  inputs.  The  nodes  are  arranged  in  layers, 
with  the  mputs  connecting  to  the  first  layer  and  the  last  layer  connecting  to  the  outputs.  The 
network  h  feedforward,  i.e.,  the  complete  directed  graph  of  nodes  and  connections  is 
acyclic. 

Each  node  functions  by  taking  each  of  its  inputs,  multiplying  it  by  an  associated 
weight,  taking  a  smooth,  monotonic  function  of  the  sum  (such  as  the  hyperbolic  tangent), 
and  then  sending  the  result  to  all  of  its  outputs.  If  the  network  is  presented  with  a  set  of 
different  inputs,  it  will  generate  an  output  for  each  one.  The  total  squared  error  in  the 
outputs  J  can  then  be  calculated,  and  the  weights  w  changed  according  to; 

•/  =  Z 

i=  1 

A 

Aw,  =  -a  ^ — 
dw, 

where: 

J  =  total  error  for  network  with  weights  w 
n  =  number  of  training  examples 
a  =  learning  rate  (controlling  step  size) 

X,  =  input  to  network  for  ih  training  example 
d,  =  desucd  output  from  network  for  ih  trainhig  example 
/■(x„w)  =  actual  output  from  network  for  ill  training  example 

The  change  in  each  weight  is  proportional  to  the  associated  partial  derivative.  In  a 
multilayer  network,  the  output  of  each  layer  is  a  simple  function  of  the  output  of  the  layer 
before  it.  This  a'lows  all  of  the  pamal  derivatives  to  be  alculated  quickly  by  starting  at  the 
output  of  the  network  and  working  backward  according  to  the  chain  rule.  Propagating 
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enors  backward  requires  as  little  computation  as  propagating  the  original  signals  forv/ard. 
Furthermore,  the  error  calculations  can  all  be  done  locally,  in  the  sense  that  information 
need  only  flew  back  through  ^he  network  along  the  connections  that  already  exist  between 
nodes.  These  properties  combine  to  make  Backpropagation  powerful  yet  low  in  both 
computation  time  and  hardware  required. 

This  gradient  descent  process  is  simple  and  works  well  for  multilayer  networks,  but 
is  not  guaranteed  to  find  the  best  weights  possible.  As  with  all  hill-climbing  methods,  it 
can  get  stuck  in  a  local  minimum.  Although  this  method  cannot  be  guaranteed  to  find  the 
correct  answer  (as  simple  perceptrons  were),  it  is  still  a  useful  method  which  has  been 
shown  to  work  well  on  a  variety  of  problems.  Unfortunately,  pure  gradient  descent 
methods  often  converge  slowly  in  the  presence  of  "troughs"  in  the  error  surface.  If  the 
error  as  a  function  of  network  weights  is  thought  of  as  a  high-dimensional  surface,  then  a 
long,  thin  trough  in  this  surface  slows  convergence.  If  the  cun-ent  set  of  weights  is  a  point 
on  the  side  of  a  trough,  then  the  gradient  will  point  mainly  down  the  side  of  the  trough,  and 
only  slightly  in  the  direction  along  the  trough  toward  the  local  minimum.  If  the  weight 
changes  in  large  steps,  it  will  oscillate  across  the  trough.  If  it  changes  in  small  steps,  then 
it  converges  to  the  local  miiumum  very  slowly. 

There  are  a  number  of  approaches  to  speeding  up  convergence  in  this  case.  One  is 
to  look  at  the  second  denvati ve  in  addition  to  the  gradient  at  each  point.  If  a  network  has 
one  output  and  multiple  weights,  then  the  second  derivative  is  a  matrix  giving  the  second 
partial  derivative  of  the  output  with  respect  to  each  possible  pair  of  weights.  This  matrix, 
called  the  Hessian,  has  a  useful  geometric  interpretation.  Multiplying  a  vector  by  this 
matrix  stretches  the  vector  in  some  directions  and  compresses  it  in  others.  For  the  direction 
in  which  the  eiror  sunace  has  least  curvature,  the  Hessian  will  compress  vectors.  For  the 
direction  in  which  the  error  surface  has  greatest  curvature,  the  Hessian  v  ill  stretch  vectors. 
Multiplying  a  vector  by  the  inverse  of  the  Hessian  has  the  opposite  effect.  Multiplying  the 
gradient  by  the  inverse  of  the  Hessian  will  cause  the  weights  to  change  more  in  the 
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direction  along  a  trough  (where  the  curvature  is  small),  and  less  across  the  irough  (where 
the  curvature  is  large).  L'  the  network  has  N  weights,  this  requires  inverting  Nby  N 
matrix  on  every  iteration  during  training.  This  overv'helming  flood  of  calculations  may 
defeat  the  purpose  by  requiring  more  computation  than  is  saved  by  the  shorter  path  to 
convergence.  This  is  why  a  number  of  approx.>mate  approaches  have  been  proposed  for 
solving  this  problem,  such  as  using  only  the  diagonal  of  this  matrix,  or  using  heuristics  that 
approximate  the  effect  of  the  inverse  Hessian. 

2.2  TRADITIONAL  CONTROL 

Control  theory  deals  with  the  problem  of  forcing  some  system,  called  the  plant,  to 
behave  in  desired  manner.  The  set  of  relevant  properties  of  the  plant  which  change  through 
time  is  called  the  state,  and  is  represented  by  the  real  vector  x.  For  example,  in  the  cruise 
control  for  a  car,  the  state  might  include  the  current  speed  and  slope  of  the  ground.  If  the 
st^'te  cannot  be  measured  directly,  then  the  sensor  readings  are  represented  by  another  real 
x'ector  y.  The  control  action  is  the  set  of  signals  applied  to  the  plant  by  the  controller,  and 
is  represented  by  the  real  vector  u.  The  plant  stale  then  is  assumed  to  evolve  in  time 
according  to; 

%,  =/(x„u,) 

Jt  =  g(X/) 

The  majority  of  control  theory  is  devoted  to  the  special  case  where  the  plant  is 

linear,  in  which  case  the  state  evolves  according  to 

X,  =  Ax,+  Bu, 
y,  =  Cx, 

where  A,  B,  and  C  are  constant  matrices.  Even  if  a  plant  is  not  truly  linear,  it  is  often 
close  enough  to  linear  within  certain  regions  of  the  stale-space  that  a  controller  can  be 
designed  for  that  region  based  on  a  linear  approxiinauon  of  the  plant.  This  is  useful  since 
the  theory  for  linear  plants  is  better  developed  than  for  nonlinear  plants  [D’AS8]. 
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Once  the  plant  has  been  modeled,  the  controSler  must  be  designed  to  accomplish 
some  purpose.  If  the  goal  is  to  keep  the  state  at.  a  certtiin  value,  then  the  controller  is  cailev’ 
a  regulator.  If  the  goal  is  to  force  the  plant  to  follow  a  given  trajectoty  with  a  specified 
transient  response,  then  the  controller  is  a  model  reference  tracking  controllei.  If  the  goal 
is  to  minimize  some  cost  function  of  the  whole  trajectory,  then  it  is  an  optimal  control 
problem. 

Traditional  control  techniques  are  based  on  approaches  such  as  bang-bang, 
proportional,  proportion:il-integral  -derivative  (PID),  gain  scheduling,  and  adaptive  control, 
each  described  in  a  section  below.  These  are  important  control  approaches  with  which 
connectionist  control  techniques  should  be  compared.  In  addition  to  this,  most  of  them  can 
be  included,  directly  or  indirectly,  in  the  hybrid  system  developed  in  this  thesis. 

Several  of  the  systems  described  here  were  first  demonstrated  on  a  standard  cart- 
pole  system  problem.  This  plant  is  illustrated  in  figure  2.2. 


In  this  problem,  the  cart  is  confined  to  a  one  dimensional  track,  and  force  can  lie 
applied  to  it  in  either  direction  to  cause  it  to  move  left  or  right.  On  top  of  the  cart  is  a  pole, 
which  is  hinged  at  the  bottom  and  can  swing  freely.  No  forces  are  applied  to  the  pole 
directly,  vo  it  is  only  influenced  indirectly  through  forces  applied  to  the  cart.  Tlie  problem 
of  balancing  the  pole  is  similar  to  the  problem  of  balancing  a  broomstick  on  a  person's 
hand.  This  is  a  standard  control  pioblein  and  is  useful  for  demonstrating  new  contiol 
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methcKls.  A  vo'sion  of  th;s  problem  is  used  here  lo  test  the  new  hybrid  control  systems 
developed  in  this  diesis, 

2.2. 1  Bang-Bang  Corttrol 

The  simplest  form  of  control  is  a  controller  that  only  has  two  possible  outputs. 
This  "bang-bang"  control  is  commonly  used  in  thermostats  which  alternate  between 
running  the  heating  system  full  on  and  mming  it  off  completely.  This  type  of  control  has 
also  been  used  in  a  learning  system  to  balance  a  pole  on  a  cart  while  keeping  the  cart  within 
a  ceitain  region  [BSA83].  Unfortunately,  bang-bang  control  systems  a'-e  generally 
incapable  of  exercising  very  fine  control,  and  so  usually  lead  to  limit  cycles  in  the  plant 
being  controlled,  i.e.  the  state  repeatedly  follows  a  certain  pericdic  path  instead  of  settling 
down  to  a  single  state.  A  pole  can  aciuJ'j  be  balanced  on  a  cait  by  always  applying  a 
certain  force  in  the  same  direction  the  poie  is  leaning.  Naturally,  this  leads  to  a  limit  cycle 
witfi  the  pole  swinging  back  and  forth  between  two  extremes.  For  finer  control,  a  more 
general  controller  is  required,  such  as  a  proportional  controller. 

2.2.2  Proportional  Control 

A  proportional  controller  is  perhaps  the  simplest  controller  imaginable  that  still  has 
continuously  varying  control  actions.  Each  input  to  the  controller  is  a  real  value, 
representing  one  clement  of  the  output  of  the  plant  being  controlled.  In  a  regulator,  that  is 
the  only  input,  and  the  controller  tries  to  control  the  plant  so  that  all  of  the  elements  of  the 
stale  vector  are  nomin:dly  zero.  In  a  general  controller,  each  element  of  the  desired  st.nte 
vector  is  also  an  input.  The  controller  then  multiplies  each  input  by  a  constant  gain, 
possibly  adds  a  constant,  and  uses  the  result  as  the  control  signal.  If  the  control  action  is  a 
vector  involving  several  signals,  then  the  same  process  is  followed  for  each  of  them,  using 
a  different  set  of  gains  each  time. 

To  design  a  satisfactory  proponional  controller,  it  is  first  necessary  to  have  a  good 
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model  of  the  system  being  controlled.  If  the  plant  is  linear  and  perfectly  modeled,  or  even 
if  the  plant  is  only  close  to  linear,  then  it  is  often  possible  for  a  proportional  controller  to  do 
an  acceptable  job  of  controlling  it. 

2.2.3  PID  Control 

If  the  control  signal  to  a  plant  is  simply  proportional  to  the  error  in  its  state,  then  as 
the  state  approaches  the  desired  state,  the  correcting  force  will  decrease  proportionally. 
Often,  tliere  will  be  some  point  near  the  desired  point  at  which  the  small  correcting  force  is 
balanced  by  other  forces,  and  the  plant  will  settle  into  a  steady  state  which  has  a  slight 
error.  In  order  to  overcome  this  steady  state  error,  the  controller  might  integrate  the  error 
over  a  long  period  of  time,  and  add  a  component  to  the  control  signal  proportional  to  this 
integral.  It  may  also  be  possible  to  improve  the  control  signal  by  taking  into  account  not 
only  the  output  error,  but  also  how  the  output  enor  is  changing  in  time.  For  this  reason  it 
may  be  useful  to  add  a  term  to  the  control  signal  proportional  to  the  derivative  of  the  output 
error. 

If  both  of  these  modifications  are  made  to  a  proportional  controller,  it  is  then  called 
a  proportional  plus  integral  plus  derivative  (PID)  controller.  If  the  input  to  this  controller 
and  the  output  from  it  are  considered  as  functions  of  time  and  the  Laplace  transform  of 
them  is  taken,  then  the  relationship  between  input  and  output  is  simple.  It  is  some 
quadratic  function  of  s  divided  by  5.  In  discrete-time  control,  this  means  that  the  output  of 
the  controller  is  a  linear  combination  of  four  things:  the  control  actions  on  the  previous  time 
step,  the  current  output  error,  the  output  error  on  the  previous  time  step,  and  the  output 
error  on  tlie  time  step  before  last.  Since  die  control  output  is  at  least  partially  proportional 
to  the  control  applied  on  the  previous  time  step,  a  small  error  in  plant  output  causes  the 
controller  output  to  keep  increasing  until  the  error  is  gone.  This  is  the  integral  portion  of 
the  controller.  Since  p!  nt  output  errors  from  three  different  time-steps  are  used,  it  is 
possible  to  subtract  them  and  estimate  how  fast  the  output  enor  is  changing.  This  is  the 
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derivative  aspect  of  the  controller.  Also,  since  the  current  output  error  affect  the  controller 
output  directly,  it  has  a  proportional  control  component.  Therefore  all  three  types  of 
control  are  present,  and  the  controller  is  referred  to  as  a  PID  controller. 

PID  control  is  widely  used;  in  fact  perhaps  90%  of  all  of  the  controllers  in  existence 
are  PID  controllers  (or  PI  or  P,  which  are  just  PID  with  some  gains  set  to  zero)  [Pal83].  If 
a  plant  is  linear,  it  is  often  possible  to  design  PID  controllers  that  give  the  desired 
performance.  If  a  plant  is  nonlinear,  but  will  usually  stay  in  some  small  region  of  its  state- 
space,  then  it  is  often  practical  to  approximate  the  plant  with  a  linear  model  in  that  region 
and  design  a  PID  controller  for  that  model.  This  model  can  be  derived  from  the  full, 
nonlinear  equations  describing  the  plant,  by  taking  the  derivative  of  those  equations,  and 
evaluating  it  at  a  given  point  in  the  middle  of  the  region  of  interest. 

2.2.4  Adaptive  Control 

Instead  of  creating  a  fixed  controller  based  on  a  priori  knowledge  of  a  plant,  it  is 
sometimes  beneficial  to  build  a  controller  that  can  change  if  the  plant  is  different  than  the 
model,  or  if  the  plant  changes  or  experiences  disturbances.  Starting  in  the  early  1950's, 
researchers  enthusiastically  pursued  adaptive  control,  especially  for  aircraft,  but  without 
much  underlying  theory.  Interest  then  diminished  in  tlie  early  1960's  due  to  a  lack  of 
theory  and  a  disaster  during  a  flight  test  [Ast83].  More  recently,  adaptive  control  is  finally 
beginning  to  reemerge  as  a  more  widely  used  approach. 

Adaptive  control  techniques  can  be  categorized  as  cither  indirect  adaptive  control  or 
direct  adaptive  control.  Indirect  adaptive  control  utilizes  an  explicit  model  of  the  p  ^mt, 
which  is  updated  periodically,  to  synthesize  new  control  laws.  This  approac*  has  the 
important  advantage  that  powerful  design  methods  (including  optimal  control  techniques) 
may  be  used  on-line;  however,  it  has  the  key  disadvantage  that  on-line  motlel  identification 
is  required.  Alteniatively,  direct  adaptive  control  docs  not  rely  upon  an  explicit  plant 
model,  and  thus  avoids  the  need  to  perform  model  identification.  Instead,  the  control  law' 
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is  adjusted  directly,  based  on  the  observed  behavior  of  the  plant.  In  either  case,  the 
controller  will  adapt  if  the  plant  dynamics  change  by  a  significant  degree. 

Adaptive  controllers  are  usually  designed  with  the  assumption  that  there  is  some 
modeling  error,  but  that  the  plant  behaves  locally  in  a  linear  manner.  The  structure  of  the 
controller  itself  is  often  limited  to  being  linear  at  any  point  in  time,  but  the  "constants"  in  the 
controller  can  change  slowly  over  time  as  it  adapts.  Even  with  all  of  these  assumptions  of 
linearity,  tfie  entire  system  consisting  of  both  an  adaptive  controller  and  a  plant  is  nonlinear 
while  the  parameters  are  adapting.  This  has  made  it  very  difficult  to  prove  that  these 
controllers  aie  stable,  although  recent  progress  has  been  made  in  this  area  [Ast83]. 

Adaptive  control  systems  generally  exhibit  some  delay  while  they  are  adjusting, 
particularly  when  noisy  sensors  are  used  (since  filtering  creates  additional  delay).  If  the 
characteristics  of  the  plant  vary  considerably  over  its  operating  envelope  (e.g ,  due  to 
nonlinearity),  an  adaptive  controller  based  on  a  linearized  model  of  the  plant  can  end  up 
spending  a  large  percentage  of  its  time  in  a  "partially"  adapted  state,  leading  to  degraded 
pe.rformance.  Moreover,the  control  system  will  have  to  readapt  every  tine  a  new  regime  of 
the  operating  envelope  is  entered. 

2.2.5  Gain  Scheduled  Control 

Although  a  system  with  significant  nonlinearities  could  be  controlled  by  an  adaptive 
controller  which  adjusts  to  the  local  linear  dynamics  in  each  region  of  the  ope;rating 
envelope,  most  control  systems  handle  nonlinearities  wiui  gain  scheduling.  Such 
controllers  are  collections  of  simple  proportional  controllers,  one  for  each  distinct  region  of 
the  operatii.g  envelope.  For  example,  in  a  typical  complex  control  system,  the  state  vector 
might  include  10  elements,  three  of  which  are  special.  When  these  tlirec  are  kept  con.stant, 
a  simple,  linear  control  law  can  work  well.  The  commands  sent  to  the  actuators  can  be  a 
dot  product  of  the  state  vector  and  a  gain  matrix.  When  any  of  the  three  s[x;ciai  elements 
change  though,  a  new  linear  control  law  with  new  gains  must  be  used.  In  a  gain  scheduled 
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controller,  the  space  of  all  possible  values  for  those  three  state  vector  elements  is  divided  up 
into  distinct  regions.  Each  region  employs  a  different  set  of  gains,  and  a  scheduler  is  used 
to  smoothly  transition  whenever  the  state  moves  from  one  region  to  another.  The 
drawback  to  this  approach  is  that  it  usually  requires  a  good  model  of  the  plant,  as  well  as 
large  amounts  of  heuristic  manual  tuning  to  decide  where  the  boundaries  between  regions 
should  be.,  and  hat  the  corresponding  control  law  in  each  region  is.  Once  the  controller  is 
built,  it  cannot  '  accommodate  a  slowly  changing  plant,  such  as  a  robot  where 

healings  wt.u  ,  id  parts  ag  .  Thi.s  contre’  technique  does  respond  instantly,  though,  when 
It  enu  a  new  re'  on,  while  ti, »  adaptive  cortro^’er  would  have  to  wait  for  more 
information  efbre  it  could  adapt  to  new  re  ion.  1^  these  and  other  reasons  (e.g., 
c^ability)  pnin  scheduled  corn  :s  *,  enera  ed  instead  of  adaptive  control  in  most 
o  'ex  systems  today 

,:.J  CONr  'RCnONISTTh  CON '^ROi  A  ‘K  \C  TES 

A  numbe*-  ol  in  :e/  n  ap^  o  '^es  have  been  sc  ,,,  k  kt  using  learning  systems 
in  control  lbu86][Jd  ii-89].  ihc  sy.  ms  g».  'r  .J'  fry  to  solve  one  of  liiree  control 
p  obieins:  p’-odc  t  sp  •ci'acd  c  »r  *  signals,  follow  s^  ecified  traiet  tories,  or  optimize 
spec-  M  remlorcc  me.  '  'c/Si  s  -  als.  F<jr  each  of  these  problems  there  are  one  or  more 
dift:  ,if  approai.  es  w  cti  la  i  i  tried,  the  )o  ’  ommon  of  which  are  described 

n 
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2.3.1  Producing  Specified  Control  Signals 


State 


Figure  2.3  Leamirtg  specified  control  signals 

The  simplest  use  of  a  connectionist  network  in  a  control  application  is  to  emulate  an 
existing  controller.  This  is  shown  in  figure  2.3.  The  controller  and  the  network  are  both 
told  the  current  state  and  tlie  commanded  state  (the  state  to  which  the  controller  should  drive 
the  plant).  The  controller  then  calculates  an  appropriate  control  sign«il  by  some  means,  and 
the  network  also  cailculates  a  control  signal.  If  they  differ,  the  difference  is  the  error  in  the 
network's  output  and  is  used  to  train  the  network  (shown  by  the  diagonal  line  through  the 
network).  In  the  figure  shown,  tlic  network  has  no  effect  cn  the  behavior  of  the  system,  it 
is  simply  a  passive  observer.  Once  the  network  has  learned,  the  weights  in  the  network 
would  be  frozen,  and  the  original  controller  would  be  completely  removed  from  the 
diagram  and  replaced  by  the  network.  One  earl  network,  Widrow-'s  ADALINE  in  the 
196()'s,  was  trained  to  balance  a  pole  on  a  cart  by  w'atching  a  human  do  it,  and  learning 
from  that  example  [Wid89].  Almost  any  general  sufx^rvised  learning  or  automatic  function 
approximation  system  can  be  used  to  control  a  system  in  this  manner,  although  the 
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technique  is  obviously  limited  to  systems  where  a  control  system  already  exists.  This 
approach  might  actually  be  useful  in  situations  where  it  would  be  too  expensive  or  too 
dangerous  to  have  a  human  controlling  a  system,  but  where  a  network  could  be  used  fairly 
cheaply.  It  would  also  be  useful  if  it  could  be  trained  by  example  to  control  under  certain 
conditions,  and  then  could  generalize  to  others.  These  are  unlikely  to  be  very  comraon 
uses  of  such  a  system. 

A  more  widely  applicable  use  may  be  as  a  component  of  a  larger  control  system  that 
leau  ns  to  reproduce  the  results  of  the  other  components.  For  example,  a  control  algorithm 
may  require  an  extensive  tree  search  on  each  time  step  that  takes  too  long  to  implement  in 
real-time,  even  in  hardware.  If  it  is  possible  to  train  a  network  to  implement  the  same 
mapping  from  state  to  outputs,  then  the  network  could  replace  the  slow  controller. 

2.3.2  Following  Specified  Trajectories 

A  much  more  common  control  problem  is  that  of  following  specified  trajectories.  If 
the  plani  being  controlled  is  fairly  well  undeistood,  and  if  it  is  not  very  nonlinear,  then  it  is 
often  possible  to  specify  a  trajectory  for  the  plant  which  is  known  to  be  both  useful  and 
achievable.  For  example,  if  a  robot  arm  is  told  to  move  from  its  cunent  position  to  a  new 
position,  the  ideal  behavior  might  be  for  it  to  instantaneously  move  to  that  position,  and 
completely  step  moving  as  soon  as  it  leaches  it.  This,  unfortunately,  requires  the 
application  of  infinite  force  to  the  arm.  On  the  other  band,  it  requires  very  little  force  to 
move  the  arm  to  the  new  position  quickly  but  wi(h  a  large  amount  of  overshoot  and 
oscillation  once  it  gets  there,  or  to  move  it  to  the  position  slowly  but  with  little  overshoot. 
There  is  a  trade  -off  between  force  applied,  time  to  get  to  the  correct  position,  and  time  to 
setUe  ince  it  is  there.  The  exact  nature  of  the  trade  -off  depends  on  the  particular  equations 
governing  the  arm.  Often,  through  partial  models  ol  the  plant,  trial  and  error,  and 
exptaience  with  simihu  plants,  it  is  possible  for  a  control  engineer  to  choose  a  particular 
trajector,'  for  tiie  arm  that  is  achievable  and  that  gives  acceptable  behavicr  for  the  particular 
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application. 

Choosing  the  reference  trajectory  iiiay  or  may  not  be  difficult  in  a  given  situation, 
but  it  is  extremely  important.  If  the  reference  trajectory  is  not  very  demanding  (causing  the 
state  to  approach  the  desired  stale  very  slowly  or  allowing  a  large  amount  of  overshoot), 
then  the  system  w'ill  not  perform  as  well  as  it  could  with  a  better  controller  .  If  the  reference 
trajectory  is  too  demanding  (causing  the  state  to  approach  the  desired  state  rapidly  with  little 
overshoot),  then  the  controller  will  attempt  to  use  control  actions  outside  the  range  of  what 
is  possible,  and  the  system  may  become  unstable. 

Once  such  a  reference  trajectory  has  been  found,  then  the  controller  must  simply  act 
at  each  point  in  time  so  as  to  move  the  plant  along  that  reference  trajectory.  Three 
approaches  for  using  connectionist  learning  systems  in  "model  reference"  control  problems 
have  been  explored;  learning  a  plant  inverse,  dynamic  signs,  and  Backpropagation  through 
a  learned  model. 

Learning  a  Plant  Inverse 


State 
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State 


A  conceptually  simple  approach  to  model  reference  control  is  to  use  a  learning 
network  to  learn  a  plant  inverse.  In  a  deterministic  plant,  the  state  of  the  plant  on  a  given 
time  step  is  a  function  of  both  the  state  and  control  action  on  the  previous  time  step. 
Alternatively,  in  continuous  time,  the  rate  of  change  of  state  at  a  given  point  in  time  is  a 
function  of  the  state  and  control  action  at  that  point  in  time.  An  inverse  of  this  function 
with  respect  to  the  control  signal  is  a  useful  function  to  know.  Given  the  current  state  and 
the  desired  next  state  (or  desired  rate  o,  chaiige  of  state),  an  inverse  gives  the  control  action 
required.  If  a  network  can  learn  such  an  inverse,  then  it  can  calculule  the  control  actions  on 
each  time  step  that  will  cause  the  plant  to  follow  a  desired  trajectory. 

Figure  2.4  illustrates  how  a  network  is  trained  to  learn  the  plant  inverse.  First, 
some  kind  of  exploring  controller  is  used  to  drive  the  plant.  This  may  not  be  a  very  good 
controller;  in  fact  it  could  even  behave  randomly.  Its  purpose  is  simply  to  exercise  the  plant 
and  show  examples  of  various  actions  being  performed  in  various  states,  i  he  network 
then  takes  two  inputs:  the  plant's  state  at  the  current  time  and  the  plant's  state  on  'he 
previous  time  step.  The  output  of  the  network  is  then  its  estimate  of  the  control  action  that 
caused  the  plant  to  make  the  transition  iroin  one  state  to  the  other.  I'his  estmiate  is  then 
compared  to  the  act.ial  c  o’lniand  to  generate  the  error  signal  used  to  tram  the  network. 

Figuie  2.5  shows  non  ilie  network  is  used  after  it  has  learne  d  Given  the  current 
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state  of  the  plant  and  the  desired  next  state  (as  specified  by  the  reference  trajectory),  the 
network  generates  a  control  action  to  move  the  plant  to  thal  new  state.  If  the  next  state  of 
ihe  plant  does  not  match  the  desired  state  ijerfectly,  then  this  error  could  be  used  to  continue, 
training  the  network.  In  this  way,  the  network  could  Icam  to  control  a  plant  whose 
dynamics  gradually  change  over  a  long  period  of  time. 

A  fundamental  problem  with  learning  the  inverse  of  the  plant  is  the  network’s 
behavior  when  the  plant  does  not  have  a  unique  inverse.  Most  network  architectures,  when 
trained  to  give  two  different  outputs  for  the  same  input,  will  respond  by  learning  to  give  an 
output  that  is  the  average  of  the  training  values.  For  example,  if  a  plant  at  a  particular  state 
can  be  forced  to  act  in  the  desired  way  by  giving  a  control  signal  of  either  1  or  3,  the 
network  will  usually  learn  an  output  of  2  for  that  state,  which  may  be  a  far  worse  action 
than  either  1  or  3. 

If  the  plant  is  a  stochastic  system,  then  the  result  of  a  single  action  will  be  an  entire 
probability  distribution  function,  which  further  complicates  the  problem  of  learning  either 
the  forward  or  inverse  model,  and  of  choosing  the  best  action.  These  problems  often  limit 
the  usefulness  of  learning  plant  inverses. 

I2ynarDlc..  Sl&ItS 
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A  learning  system  using  dynamic  signs  is  shown  in  figure  2.6.  For  a  given  state, 
the  network,  tries  to  find  an  action  that  will  drive  the  plant  to  the  next  state  on  the  trajectory 
defined  by  the  reference.  If  it  does  this,  then  the  new  state  will  equal  the  output  of  the 
reference,  and  the  subtraction  will  yield  zero  error,  so  no  learning  will  occur.  On  the  other 
hand,  if  there  is  an  error  in  the  state,  then  each  vyeight  in  the  network  should  be  adjusted 
proportionally  to  its  effect  on  tiiat  error.  Finding  the  effect  of  a  given  weight  on  the  control 
signal  is  easy;  «t  is  simply  the  partial  derivative  of  the  control  wi,h  respect  to  that  weight. 
To  find  its  effect  on  the  plant's  slate,  however,  it  is  necessary  to  know  the  partial  derivative 
of  the  state  of  the  plant  with  respect  to  the  control  signal. 

Often  the  general  behavior  of  a  plant  is  known,  ever  though  all  the  e  an  equations 
and  constants  are  not  known.  For  example,  it  is  often  clear  that  applying  more  control 
action  will  cause  one  element  of  the  state  to  increase  and  another  one  to  decrease,  even 
though  it  is  not  possible  to  predict  exaedy  how  much  change  will  occur.  In  this  case,  tliC 
partial  derivative  of  state  with  respect  to  control  is  not  known,  but  the  sign  of  the  paitial 
derivative  is  known.  If  the  actual  paj.tial  derivatives  were  known,  then  the  error  in  state 
would  be  ^nultiplied  by  the  derivative  before  behig  used  to  train  the  network.  Since  only 
the  sign  of  the  derivative  is  assumed  kiiowr,,  each  clement  of  the  error  is  meredy  multiplied 
by  p:u^  or  minus  one.  figure  2.6  shows  how  the  enor  in  the  state  is  multiplied  by  this 
value  before  being  used  to  train  the  network.  This  "dynamic  .sign"  has  been  shown  in 
som.e  cases  to  contain  enough  infomauion  to  cause  the  neovork  to  converge  on  a  reasonable 
controiler  [FGG90].  It  has  been  shown  fOF90j[EF90]  that  for  autonomous  submarine 
conuol  with  a  multidimensional  state  vectoi  and  a  scalar  control,  the  system  can  learn  to  be 
an  effective  controller  using  dynamic  signs. 


ATTACHMENT  1 


Figure  2.7  Backpropagation  through  a  plant  mcidel 


A  more  general  approach  than  dynamic  signs  is  for  one  network  to  act  as  a 
controller  while  a  second  network  learns  to  model  the  plant.  On  each  time  step,  the  second 
network  takes  the  current  state  and  control  actions  as  input,  and  tries  to  predict  what  the 
change  in  state  will  be,  adjusting  its  parameteis  according  to  the  error  in  its  prediction.  If 
the  second  network  is  differentiable  everywhere,  which  is  the  case  in  networks  that  use 
Backpropagation,  then  when  it  learns  the  model,  it  will  also  know  all  of  the  partial 
derivatives  for  the  plant.  This  then  allows  errors  in  the  state  to  be  backpropagated  through 
both  networks  in  order  to  change  the  parameters  of  the  first  network  so  that  it  can  learn  to 
control  the  plant  model.  This  is  the  same  as  the  dynamic  signs  approach  described  above, 
except  that  the  partial  derivatives  across  the  plant  are  estimated  automatically  instead  of 
being  set  to  plus  or  minus  one  by  hand  according  to  a  priori  information. 

Figure  2.7  illustrates  this  process.  The  network  on  the  right  is  trained  to  predict 
what  the  next  state  of  the  plant  will  tie,  given  the  coirent  state  and  control.  ITiis  training  is 
indicated  by  the  solid  diagonal  arrow  through  the  network.  At  the  same  time,  the  network 
on  the  left  is  iriuned  to  be  a  better  controller.  Tlus  is  done  by  propagating  the  error  in  slate 
through  both  networks,  while  only  changing  weights  in  the  controller  network,  Although 
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this  signal  propagates  through  the  plant  model  network,  it  is  not  used  to  train  that  network, 
which  is  why  it  is  represented  in  the  diagram  by  a  dotted  arrow.  This  approach  has  been 
successfully  used  by  Jordan  [Jor88]. 

2.3.3  Optimizing  Specified  Reinforcement  Signals 

The  above  techniques  are  all  based  on  the  assumption  that  there  is  a  reference 
trajectory  to  follow.  At  each  time  step,  given  the  current  state,  it  is  assumed  that  the  desired 
change  in  state  is  known.  For  some  systems  though,  finding  a  reference  trajectory  is  fully 
as  difficult  as  finding  the  controller  in  the  first  place.  For  example,  a  large  semi  truck 
consists  of  two  sections  with  a  hinge  between  them  If  the  truck  is  near  a  loading  dock  and 
at  an  angle  to  it,  it  can  be  difficult  to  calculate  how  to  back  up  the  truck  .,0  that  it  ends  up 
with  the  back  end  lined  up  with  the  dock  [NW89].  This  procedure  may  involve  turning  the 
wheel  all  the  way  to  the  left,  backing  up  some,  then  gradually  turning  it  to  the  right,  then 
finally  straightening  it  out,  causing  the  truck  to  follow  an  "S"  shaped  path.  If  the  path  to 
tollow  is  known,  it  is  »rivial  to  calculate  how  to  turn  the  wheel  to  follow  the  path,  but 
finding  the  correct  path  in  the  first  place  is  a  difficult  problem.  l”he  model  reference 
^iystems  discussed  above  arc  therefore  not  useful  for  solving  this  type  of  problem.  In  this 
case,  the  goal  is  actually  to  minimize  a  quantity  after  a  certain  period  of  time  (the  distance 
from  the  dock  at  the  end),  rather  than  to  follow  a  given  trajectory. 

This  is  just  one  example  of  the  most  general  type  of  control  problem,  which  is  the 
optirmzation  of  some  quantity  over  tunc.  TTiis  is  called  "Rcmforccmcnt  Learning"  .since  the 
goal  of  the  controller  is  to  maximi.7e  some  external  reinforcement  signal  over  time  [^^88]. 
Since  several  actions  may  be  performed  before  the  reinforcement  is  received,  it  is  often 
difficult  to  delerTninc  which  of  the  actions  were  good  and  which  we'c  bad.  This  "temporal 
credit  assignment  problem"  makes  reinforcement  learning  the  most  difficult  type  of  problem 
considered  here  Control  problems  of  this  type  include  backing  up  a  truck  10  minimize  die 
error  at  the  end,  finding  the  route  to  the  moon  tfiat  requires  the  least  fuel,  or  finding  tlic 
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actions  for  an  animal  that  maximize  the  amount  of  food  it  finds.  All  of  these  cases  involve 
maximizing  a  reinforcement  (or  minimizing  a  cost)  over  some  period  of  time  (finite  or 
infinite).  This  is  a  difficult  problem,  since  it  may  be  necessary  to  perform  actions  that 
appear  "worse"  in  the  short  run,  but  are  "better"  in  the  long  run.  If  a  controller  generates 
some  action  and  then  receives  negative  reinforcement  (or  positive  cost),  it  is  not  clear 
whether  that  is  the  immediate  result  of  that  action  or  the  delayed  result  of  a  much  earlier 
action.  Thus  it  is  not  clear  how  to  learn  the  correct  action,  or  even  how  to  evaluate  a  given 
action. 

This  difficult  control  problem  has  been  addressed  by  Backpropagation  through 
time,  actor-critic  systems,  and  dynamic  programming  systems.  Actor-critic  systems  and 
dynamic  programming  systems  tend  to  b  v'ith  some  overlap,  but  are  a 

useful  way  of  classifying  the  many  approaches  to  this  type  of  problem. 


State 


Figure  2.8  Backpropagation  Uirough  lime:  Icainini’  the  plant  model 


One  w'ay  to  solve  the  reinforcement  learning  control  problem  is  to  extend  the  idea  of 
backpropagating  through  a  plant  nuxiel.  Two  networks  are  used.  One  is  trainc.i  ever^/  time 
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step  to  learn  to  model  the  plant.  Figure  2.8  shows  hew  the  network  can  learn  to  n.odel  the 
plant  using  the  current  state,  the  previous  state,  and  the  previous  control  action. 


State 

Commanded 


State 


Controller 

Network 


Control 


Plant  Model 
Network 


State 


Figure  2.9  Backpropagation  through  time-  learning  the  controller 

Once  it  has  learned,  the  second  network  can  learn  to  be  a  controller  based  on  the 
plant  model.  The  two  networks  are  connected  as  shown  in  figure  2.9.  With  all  parameters 
fixed,  the  plant  model  network  starts  at  some  initial  position,  and  tlie  controller  network 
controls  the  model  for  a  period  of  time  that  is  known  to  be  long  enough  to  drive  the  plant  to 
the  desired  state.  All  of  the  signals  going  through  the  networks  are  recorded  during  the 
trial. 


Initial  state  State  commanded  State  commanded 

commanded  at  time  1  at  time  N 


time  1  time  N 


Figure  2,  H)  The  networks  unrolled  in  time 


The  (WO  netv./ork.s  are  then  "unrolled  in  lime",  so  that  it  ItKiks  like  the  signals  have 
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passed  through  a  very  long  network  once,  instead  of  passing  through  two  small  networks 
many  times.  The  cost  or  reinforcement  signals  are  calculated  from  the  plant  model  state  at 
certain  time  steps  of  the  "unrolled"  network.  In  the  case  of  the  truck  backer-upper  example 
[NW89],  this  signal  is  zero  on  every  time  step  until  the  end,  and  then  is  equal  to  tlie  error  in 
state  after  the  last  time  step.  This  error  can  be  backpropagated  through  the  large  "unrolled" 
network  to  change  all  of  its  parameters,  thus  forcing  the  controller  to  be  slightly  better 
throughout  the  whole  trial.  This  "Backpropagation  through  time"  procedure  has  been 
shown  to  be  able  to  solve  the  problem  of  backing  up  a  truck  [NW89].  It  is  related  to  ideas 
suggested  by  Werbos  [Wer89]  and  to  work  done  by  Jordan  [Jor88]  and  Jameson  [Jam90] 
where  signals  are  propagated  back  through  time  during  training. 

Backpropagation  through  time  does  have  the  difficulty,  unfortunately,  of  requiring 
that  every'  signal  on  every  time  step  during  each  trial  be  saved.  For  long  trials  this  could  be 
a  problem.  Other  algorithms  could  be  used  instead,  such  as  tfiC  Williams-Zipser  algorithm 
for  training  recurrent  networks  [WZ89].  This  has  memory  and  processing  requirements 
that  are  independent  of  tfie  length  of  the  trial,  but  proportional  to  the  cube  of  the  number  of 
nodes  (assuming  fully  interconnected  nodes),  so  that  it  can  also  be  impractical  for  large 
networks. 

Actor-Criiic  Systems 

Backpropagation  tlirough  time  is  potentially  a  very  useful  technique,  but  is  still  not 
completely  general.  Even  assuming  the  networks  can  perfectly  model  the  functions  they 
arc  trained  with,  the  result  will  still  be  a  controller  that  causes  the  plant  to  follow  a  locally 
opiiinal  path.  The  path  will  be  such  that  any  small  change  to  it  will  make  it  worse,  altliough 
a  large  change  to  the  path  as  a  whole  might  still  improve  it  significantly.  The 
Backpropagation  through  time  algorithm  also  requires  storing  all  of  the  signals  going 
through  the  netw  ork  diroughout  the  whole  trial.  In  a  regulator  problem,  where  the  the  phmt 
may  never  fail  and  may  never  reach  the  goal  state  exactly,  fhe  tnal  will  l)c  infinitely  long. 
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An  aJtemaiive  approach  that  avoids  sonie  of  these  difficulties  is  to  use  a  syst  .  a  with  two 
components,  c^Jled  an  "actor"  and  a  '’critic."  Hie  actor  is  the  actual  crvntroiier  that,  given 
the  state,  deicides  which  control  actions  should  be  used.  'Ihe  critic  is  a  component  that 
receives  externa)  reinforceme,r!t  signals  and  uses  them  to  train  the  actor.  This  is  still  a 
difficult  problem,  since  reinforcement  may  come  long  aftej  'he  actions  that  caused  it.  In 
fact,  the  best  actions  may  actually  ijncrease.  enws  liefore  they  s/ v,  o  decretise  them,  ;ind  tit*' 
critic  most  recognize  that  this  is  the  cs.se.  For  ‘.xatnp)  ■  with  .  >'  "'■i  pole  system,  if  thie 

cart  starl.s  at  the.  origin  with  the  pole  balanced,  and  the  gca;  i.s  to  nw.  me  meter  to  the 
right,  the  reinforcement  on  each  time  step  might  be  the  negative  of  the.  t  '  ion  eiTor.  The 
fastest  way  to  move  the  system  one  meter  to  the  rig.ht  without  tellowing  tiie  pole  to  fall  over, 
is  to  first  move  left,  causing  the  pole  to  tilt  to  tfie  right,  and  thc.«  move  qu  ickly  to  the  right. 
Tlius  the  error  in  position  should  increase  before  it  decreases.  If  the  act  f  is  to  learn  the 
control  actions  that  will  accompii.sh  tlii.s,  the  critic  must  fii-st  letirn  to  recoga-ize  that  tSiis  is 
desirable.  It  will  have  to  learn  that  a  huge  position  error  with  the  fHXiC  tilted  the  concct  way 
is  sometimes  preferable  to  a  s-rnaller  position  error  with  the  pole  tilted  the  wrong  way. 

Samuel's  checker  player  [Sam,59J  was  one  of  the  ea.ri:icst  .systems  to  rake  this 
approach.  The  actor  was  .an  algorithm  that  switched  between  Lx, ok  playing  and  an  alpha- 
beta  tree  search  The  se,a,rch  was  based  on  the  relative  desirability  of  vimous  board 
positions,  as  determined  by  the  critic.  The  c.ritic  \v;>,s  a  linear  corabinaiioa  of  several  hand- 
built  heuristic  ,funcl.\on.s,  and  leaniing  for  die  enne  c  Misisted  of  aduisting  the  weights  of  die 
linear  combination,  and  also  deciding  wdiich  o.f  a  i.'.rg«‘  number  of  iheurist.ic  functions  should 
lx‘  mcluded  in  the  combination 

Michie  and  Chambers  [MCibS]  dcvelopcal  the  Bo  res  system  which  consisted  of  an 
actor  and  a  simple  critic.  They  applied  tb.'ur  controller  (  a  can-pole  .sy'-iens  that  would 
signal  a  failure  .vhenever  the  pole  fell  over.  'Tlic  criuc  has  d  .u.s  e.v,;:,u!atiov!  ol  a  s'ianicuia.r 
state  on  the  u  "iber  of  time  steps  Ixttwecri.  .eaiering  siiai  .stale  and  faiune.  i  iu;-.  vstetii  was 
lal.  r  irnpiovcd  by  Barto,  Sutton,  .^nd  Auderson  fBS.AB.CI  with  the  d'-vdopi  '.'sit  of  the 
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Associative  Search  Element  (ASE)  and  Adaptive  Critic  Element  (ACE).  In  that  system,  the 
critic  based  its  evaluation  on  both  the  time  until  failure  and  the  change  in  evaluation  over 
time.  Evaluations  were  therefore  both  predictions  of  the  desirability  of  a  given  state,  and 
estimates  of  what  the  evaluations  would  be  in  future  states.  Tliis ;  ystem  learned  to  balance 
a  pole  on  a  cart  more  quickly  than  the  Boxes  system. 

E)vnarriic  Programming  Systems 

Dynamic  programming  is  a  class  of  mathematical  techniques  for  solving 
optimization  problems.  Often  in  a  problem  the  sets  of  possible  states  and  actions  are  finite, 
or  can  be  approximated  as  finite  sets..  The  problem  is  to  find  the  best  control  action  for 
each  state,  taking  into  account  that  it  may  be  profitable  to  perform  actions  with  low 
reinforcement  (or  high  cost)  in  one  state  in  order  to  reach  another  state  that  gives  high 
reinforcement  (or  low  cost).  Not  only  is  an  action  associated  with  each  state,  but  typically 
one  or  more  other  values  aie  associated  with  each  state  as  well. 

The  most  common  fomiulation  of  dynamic  programming  associates  two  values 
with  each  state,  A  "policy"  is  the  action  that  is  currently  consi  iered  to  be  the  best  for  a 
given  state,  .An  "ev  iluaii  :),q"  of  a  state  is  &n  estimate  of  the  long-  trm  reinforcement  or  cost 
that  w.ill  be  exf,>eiieaccd  if  the  system  starts  in  that  state  and  performs  optimal  ictions 
thereafter.  .Ail  policies  and  evaluations  are  initialized  lu  o  *.  set  of  values,  ant  'lien 
individual  values  are  i,mproved  tn  some  order.  A  given  pt  icy  ot  evaluation  is  improved  by 
.setting  it  equal  to  the  value  that  would  be  appropriate  for  it  if  the  values  of  its  neighbors 
were  correct.  It  this  prcK.ess  is  done  repeatedly  to  policies  anti  evaluations  in  all  the 
fcgions,  tlien  under  certain  circumstances  it  is  guaranteed  to  converge  to  the  oplim,il 
solution  (WB90!  I  he  sc  of  policies  function  somewhat  as  an  actor,  while  the  set  of 
evaluations  funciion  at;  a  riuc.  Reinforcement  learning  with  actor-critic  systems  mav 
therefore  '.ornetinics  ftr  ihi'nght  of  as  a  kind  of  vlynamic  piogramming. 

Other  fyix'S  of  dynamic  p'l ograraming  systems  do  not  resemble  actoi  culk  systems. 
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Q  ai  iung,  devised  by  Waikins  [Wat89].  only  involves  one  type  of  value.  For  each 
po5^;sibk  action  in  each  possible  state,  a  number  (the  "Q  value")  is  stored  that  represents  the 
expecb  iong-term  results  if  that  action  is  performed  in  that  state  followed  by  optimal 
actions  i  ‘tiL  '  As  in  the  other  forms  of  dynamic  programming,  a  Q  value  is  updated  by 
chi.  ing  it  lO  closer  to  the  value  that  would  be  appropriate  for  it  if  the  Q  values  of  all  its 
neigiiot  rs  arc  a  sumed  to  be  correct.  Q  learning  is  also  guaranteed  under  certain 
assui  ptiMUS  to  converge  to  the  optimal  solution. 

The  nbove  discussion  assumed  that  the  sets  of  pos-sible  states  and  actions  were 
finite,  ii  there  is  a  continuum  of  states  and  actions,  then  an  approximation  to  dynamic 
programming  i  lu  it  be  used.  The  most  common  approximation  is  to  divide  the  state-space 
nio  small  ir  gums,  and  store  evaluation  and  policy  values  for  each  regu  i.  If  the  state- 
is  higa-t.imensional,  this  will  require  prohibitively  many  values  to  be  stored,  and 
u>  irog.  amining  is  not  feasible.  A  natural  solution  to  this  "curse  of  dimensionality" 
is  )i  I'-  ii  rn  of  function  approximation  system  to  store  the  evaluation  and  policy  for 

he  (.  »n  ifc  I  '  of  states.  Connectionist  systems  are  a  natural  candidate  for  this  use. 

his  s  tioi  i..t  iescribed  systems  for  solving  the  problems  of  emulating  a 
specific  :oi  t,  iollowuig  ,i  specified  trajectory,  and  (  otimi/ing  a  specified  signal. 
Vone  ot  e  .  ins  ck'^cnbed  here  make  use  of  much  a  priori  ki  owledge  of  the  plant. 
Oftc  1,  la  !  a, Cl  lie  rnu  u  ls  of  a  piant  exist,  and  ii  would  be  useful  to  have  oroc  method 
)i  t|iiu-k)  a  purac  g  tins  k,,  v  icuge  into  ihe  ci  ntrollcr.  The  sysienis  Jescn.xid  beie  also 
d  i  le  “I  V  lo  vly  to  haiu'e  in  the  plai  t,  since  the  network  must  lean  a  new 
Hi  iL  tkr  A  )•  cv  r  (he  piaui  ch.iugc:  i;  e  are  problems  that  tliis  the, sis  addresses. 
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3  HYBRID  CONTROL  ARCHITECTURE 


The  architecture  presented  here  represents  a  new  method  of  integrating  learning  and 
adaptation  in  a  synergistic  arrangement,  forming  a  single  hybrid  control  system.  The 
adaptive  portion  of  the  controller  provides  real-time  adaptation  to  time-varying  dynamics 

and  disturbances,  and  initially  accommodates  any  unknown  dynamics!  The  learning 

! 

portion  deals  with  static  or  very  slowly  changing  spatial  dependencies.  Tne  latter  includes 
any  aspect  of  the  plant  dynamics  that  varies  predictably  with  the  current  state  of  the  plant  or 
the  control  action  applied. 

A  conventional  adaptive  control  system  reacts  to  discrepancies  between  the  desired 
and  observed  behaviors  of  the  plant  to  achieve  a  desired  closed-loop  system  performance. 
These  discrepancies  may  arise  from  time-varying  dynamics,  disturbances,  sensor  noise,  or 
unmodeled  dynamics.  The  problem  of  sensor  noise  is  «rually  addressed  with  filters,  while 
adaptive  contred  itself  is  used  to  handle  the  remaining  sources  of  observed  discrepancies. 
In  practice,  little  can  be  done  in  advance  for  time-varying  dynamics  and  disturbances;  the 
control  system  must  simply  wait  for  these  to  occur  and  then  react  On  the  other  hand, 
uninodeled  dynamics  that  are  purely  functions  of  stiste  can  be  predicted  from  previous 
experience.  This  ^  iie  task  given  the  learning  system,  Initially,  all  unmodcled  dynamics 
are  handled  by  the  tdaptive  system;  eventually,  however,  the  learning  system  is  able  to 
anticipate  previous  v  experienced  unmodeled  dynamics.'  TTius,  the  adaptise  system  is 
free  to  react  to  time-varying  dynamics  and  disturbances,  and  is  not  burdened  w  ith  the  task 
of  reacting  to  predictable,  yet  initially  unmodcled  dynamics 

The  hybrid  adaptive  /  learning  system  presented  in  this  thesis  accoimiioc-ates  both 


This  assumes,  of  course,  that  the  order  of  the  plant  (dimension  of  its  state  vector  >  is  accurately  known. 
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temporal  and  spatial  modeling  uncertainties.  The  adaptive  part  has  a  temporal  emphasis;  its 
objective  is  to  maintain  the  desired  closed-loop  behavior  in  the  face  of  disturbances  and 
dynamics  that  are  time  varying  or  appear  to  be  time-varying  (e.g.,  a  change  in  behavior  due 
to  a  change  in  operating  conditions).  The  learning  part  has  a  spatial  emphasis;  its  objective 
is  to  facilitate  the  development  of  the  desired  closed-loop  behavior  in  the  presence  of 
unmodeled  nonlinearities  within  the  operating  envelope.  Typically,  the  adaptive  part  has 
relatively  fast  dynamics,  while  the  learning  part  has  relatively  slow  dynamics.  The  hybrid 
approach  allows  each  mechanism  to  focus  on  the  part  of  the  overall  control  problem  for 
which  it  is  best  suited,  as  summarized  in  Table  3. 1 . 


ADAPTATION 

U^ARNING 

reactive;  maintain  desired  closezl-loop 

behavior 

const:  uctional:  synthesize  desired 
closed-loop  behavior 

temporal  emphasis 

spatial  emphasis 

1  7  memory  =>  no  anticipation 

memory  =>  anticipation 

fast  dynamics 

slow  dynamics 

local  optimization 

global  optimization 

real-time  adaptation 
(time-varying  dynamics) 

design  &  cn-line  tuning 
(spatial  dependencies) 

Table  3.1.  Adaptation  vs.  learning 

A  schematic  of  one  possible  realization  of  a  hybrid  adaptive  /Icarrii.ag  control 
system  is  shown  in  Figure  3  1. 
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Figure  3. 1  Hybrid  adaptive  /  learning  controller 


To  simplify  the  discussion,  we  assume  that  all  necessary  plant  state  variables  are 
observable  and  measured;  in  the  event  tlsat  this  is  not  the  case,  a  state  observer  would  have 
to  be  used.  The  indirect  adaptive  controller  outputs  a  control  action  based  upon  the  current 
state,  the  desired  state,  and  the  estimated  behavior  of  the  system  being  controlled.  This 
estimate  characterizes  the  current  dynamical  behavior  of  the  plant.  If  the  behavior  of  the 
plant  changes,  the  estimator  within  the  adaptive  controller  will  update  the  model.  If  plant 
changes  arc  unpredictable,  then  the  estimator  will  attempt  to  update  the  model  as  quickly  as 
possible,  based  on  the  information  available  in  the  (possibly  noisy)  sensor  readings. 
Adapting  to  predictable  model  enors  that  are  functions  of  state  will  take  just  as  long  as 
adapting  to  unpredictable  disturbances  and  temporal  changes,  assuming  similar  noise 
levels. 

The  problem  of  accommodating  predictable  spatial  dependencies  is  handled  by  me 
learning  system  in  the  outer  loop.  It  monitors  the  indirect  adaptive  controller's  postenor 
estimate  of  the  plant  parameters,  and  leams  to  associate  the  appropriate  plant  parameters 
with  each  point  in  the  statc-.spacc.  The  learning  system  can  then  anticipate  phint  behavior 
based  on  past  ex^ieriencre,  and  give  its  prediction  to  the  indirect  adaptive  controller,  lliis 
allows  the  controller  to  accommodate  predictable  dynamics  while  still  retaining  the 
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capability  to  rapidly  adapt  to  unpredictable  dynamical  effects. 

The  learning  system  used  here  is  a  feedforward  multilayer  network.  The  input  is 
the  current  state,  and  the  output  is  a  prediction  of  what  the  plant  parameters  should  be, 
given  that  state.  Prior  to  learning,  the  network  is  initialized  so  that  the  hybrid  controller 
will  give  the  same  control  signals  that  the  adaptive  controller  would  give  by  itself.  When 
the  network  has  correctly  learned  the  mapping,  the  hybrid  adaptive  /  learning  controller  will 
anticipate  nonlinear  model  errors  that  are  functions  of  state  and  are  predictable,  and  will 
respond  faster  and  more  efficiently  than  a  simple  adaptive  controller  would.  If  these  spatial 
dependencies  change,  then  the  hybrid  will  act  as  an  adaptive  controller  until  it  learns  the 
new  mapping.  The  entire  system  is  automatic;  no  explicit  switching  mechanism  is  needed 
to  go  from  adaptation  to  learning. 

Note  that  any  type  of  connectionist  network  could  be  used  for  the  learning  system, 
it  need  only  have  the  ability  to  learn  functions  from  examples.  This  part  of  the  system 
could  even  be  some  other  fona  of  associative  memory  such  as  a  lookup  table  or  a  nearest 
neighbor  classifier.  In  practice,  though,  these  Types  of  teclmiques  may  be  impractical  since 
a  potentially  infinite  number  of  example  points  arc  used  for  liaining,  and  the  state-space 
may  have  a  high  number  cf  dimensions.  For  this  reason,  a  connectionist  network  seem.s 
more  appropriate. 

.  1  THE  LE  AF.N  ING  COMPONENT 

The  hybrid  architecture  allows  any  learning  system  to  be  used  that  can  leani  to 
approximate  a  function  from  a  large  .set  of  examples  of  that  function.  The  first  learning 
system  examined  here  wa.s  a  feedfoiward,  Backpropagation,  sigmoid  network.  The  inputs 
to  the  network  and  the  outputs  from  the  network  were  scaled  to  vaiy  over  a  range  of  unit 
width.  The  (raining  examples  were  stored  in  a  large  buffer,  and  were  presented  to  the 
network  in  a  random  order.  The  network  was  named  incrementally,  weights  weie  changed 
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after  each  training  example  was  presented. 

The  network  was  then  tried  with  a  different  learning  algorithm,  a  heuristic  method 
that  approximates  the  effect  cf  using  the  Hessian  to  scale  the  weight  changes.  A  modified 
**  version  of  this  algorithm  was  also  iried. 

The  learning  systems,  and  the  reasons  behind  their  choice  for  this  application,  are 
described  in  further  detail  in  chapter  4. 

3 . 2  THE  ADAPTIVE  COMPONENT 

The  adaptive  component  of  a  hybrid  controller  can  be  any  indireet  adaptive 
controller  that  can  incorporate  outside  information.  The  controller  might,  for  example, 
estimate  parameters  of  the  plant,  and  then  act  as  the  best  controller  for  those  parameters.  It 
might  instead  estimate  the  amount  of  error  in  its  predictions  of  the  new  state  or  the  next 
step,  and  try  to  compensate  for  it.  For  the  experiments  performe*^  here,  an  indirect  adaptive 
controller  of  the  latter  type  was  used,  both  in  its  original  form  and  with  modifications. 

A  technique  based  on  Time  Delay  Control  (TDC)  was  chosen  as  the  adaptive 
system  for  the  expieriments  presen'cd  here.  TDC  is  an  indirect  adaptive  control  method 
developed  by  Youcef-Toumi  and  Osamu  fYlSK)]. 

This  system  v/orks  by  iooking  at  the  difference  betv/een  inc  current  state  of  the 
piant  and  the  stale  of  the  plant  on  the  previous  time  step.  This  difference,  along  with 
knowledge  of  what  action  was  chosen  on  the  previous  time  step,  is  used  to  estimate  the 
effect  that  the  unmodelcd  dynamics  arc  having  on  the  system.  ITiis  value  h  is  calculated 
explicitly  and  plays  a  pivotal  role  in  the  remaining  calculations.  A  control  action  is  then 
generated  to  tuuicel  the  unwiuited  effects  (modeled  and  estimated)  and  to  induce  the  desired 
behavior  in  the  plant.  The  technique  uses  information  that  is  only  one  time  step  old,  so  it  is 
able  to  react  to  sudden  changes  in  the  plant  or  environment  after  a  single  time  step.  Of 
course,  since  it  is  in  effect  differentiating  the  state,  it  is  ser.dtive  to  high  frequency  noise. 
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Youcef-Toumi  points  out  tiiat  this  is  not  as  bad  as  it  seems  if  the  plant  itself  acts  as  a  low- 
pass  filter,  attenuating  the  effect  of  the  noise  in  the  control  actions. 

The  controller  can  also  be  made  less  sensitive  to  high  frequency  noise  by  simply 
using  a  larger  time  step  and  a  filter.  This,  however  causes  it  to  react  more  slowly  to 
changes  in  the  plant.  Overall,  TDC  does  a  good  job,  but  it  cannot  both  react  quickly  and 
remain  insensitive  to  high  frequency  noise. 

3 . 3  THE  f  TYBRID  SYSTEM 

The  connectionist  network  used  in  the  hybrid  adaptive  /  learning  controller  is  a 
simple,  feedforward,  back-propagation  network,  with  two  hidden  layers  of  ten  nodes  each. 
Given  the  state  and  goal  for  the  plant,  the  network  could  be  trained  to  output  an  estimate  of 
the  unmodeled  dynamics  h.  In  the  absence  of  noise,  this  should  be  the  same  h  that  TDC 
calculates.  If  noise  is  present,  it  may  be  possible  to  determine  the  current  state  of  the  plant 
to  within  a  small  eaor.  The  correct  h,  however,  is  difficult  to  calculate  precisely,  because 
it  is  found  by  "differentiating"  the  state  (e.g.,  using  a  backwards  difference). 

One  property  of  connectionist  networks  is  useful  here.  During  training,  a  network 
is  given  input  and  desired  output  values  repeatedly.  If  it  is  given  conflicting  desired 
outputs  for  the  same  input,  then  it  tends  to  average  them.  This  means  that  the  network  can 
be  trained  with  data  that  has  small,  zero  mean  noise  and  still  leam  the  correct  mapping. 
Therefore,  if  TDC  calculates  noisy  h's  with  an  equal  probability  of  die  value  being  too  Irigh 
or  too  low  for  a  given  state,  and  if  these  are  used  to  train  the  network,  then  tlie  network  will 
tend  to  learn  the  correct  h  for  each  state. 

Moreover,  a  learning  system  is  not  only  useful  when  h  is  noisy;  it  is  also  helpful 
when  it  is  used  to  predict  h.  In  its  original  forar,  TDC  looks  at  the  state  of  the  plant  before 
and  after  a  given  time  step.  Back-differencing  to  estimate  the  derivative,  TDC  can  then 
calculate  the  nnmodeled  dynamics  h  dunng  that  period.  TTiat  h  is  then  used  to  calculate  the 
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appiopiiaie  control  action  to  be  applied  to  the  plant  during  the  next  time  step.  This  is  a 
source  ot  error  in  the  controller,  since  it  is  always  sending  out  control  actions  based  on 
what  was  correct  on  the  previous  time  period.  With  the  network,  there  is  a  simple  solution 
to  this  problem.  Instead  of  associating  h  with  the  current  state  during  training,  it  is 
associated  with  the  previous  state.  After  the  network  has  been  trained  with  those  patterns, 
it  should  be  able  to  predict,  given  a  state,  what  h  will  be  during  the  time  step  following 
that  state.  This  allows  a  better  estimate  to  be  calculated. 

The  hybrid  controller,  therefore,  has  at  least  the  potential  to  solve  both  of  the 
difficulties  associated  with  the  original  adaptive  controller.  This  is  in  addition  to  the  main 
problem  it  was  designed  to  solve:  lea  ning  control.  These  considerations  provide 
motivation  for  experimenting  with  the  hybnd  controller. 

The  hybrid  adaptive  /  teaming  controller  typically  runs  at  a  speed  such  that  the  states 
and  h  do  not  change  much  over  a  period  of  several  time-steps.  If  the  network  is  trained  on 
similar  states  several  times  in  a  row,  it  may  "forget"  what  it  knows  about  other  states.  One 
solution  might  be  to  train  the  network  less  frequently,  such  as  once  a  second.  This  might 
be  effective,  but  it  would  slow  down  learning  by  not  learning  every  time  step.  A  better 
solution  is  to  use  a  random  buffer.  During  training,  as  the  plant  wanders  through  the  state- 
space,  the  data  from  each  time  step  is  stored  in  the  buffer.  One  point  is  also  chosen  at 
random  from  tlie  buffer  on  each  time  step,  and  is  u,sed  to  train  the  network.  1  his  ensures 
that  the  network  is  trained  on  a  distributed  set  of  points. 

3.4  DERIVATION  OF  ITffi  HYBRID  WITH  KNOWN  CONTROL  EFFECT 

The  original  TDC  equations  were  designed  to  allow  the  incorporation  of  a  priori 
infonnation  consisting  of  a  linear  model  of  the  plaitt.  The  effect  of  control  action  on  state 
was  assumed  to  be  known  perfectly,  but  the  other  parameters  could  initially  be  incorrect. 
The  following  is  a  derivation  of  the  TDC  equations  for  a  discrete  time  plant  where  the 
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known  dynamics  ai*e  descnbed  by  the  a  priori  matrices  ^  and  F,  as  well  as  the  knowledge 
gained  by  me  learning  system  W.  As  in  the  original  TDC,  the  effect  of  control  on  state  is 
assumed  to  be  a  linear  function,  and  the  constant  F  is  assumed  to  be  known  without  any 

e.rror. 

Assume  that  the  plant  being  controlled  is  of  the  form 

x(k+l]  -  <&x(/c)  +  T\i(k)  +  ^(K(k))  +  hix(k),k) 

where  at  time  k,  x  is  the  state  vector,  u  is  the  control  vector,  and  all  of  the  unknown 
dynamics  are  represented  by  the  function  h.  The  vector  '¥  is  the  output  of  die  learning 
system. 

The  reference  system  has  the  dynamics 

+  Fml-W 

where  r  is  the  command  vector.  The  error  between  the  actual  state  and  the  reference  state  is 

t(k).-=x^ik)~x(k)  (3) 

The  god  is  to  build  a  controller  that  wUJ  cause  the  error  to  behave  as: 

e(fc+l)  =  {Om+K}e(it)  W 

where  K  is  the  error  feedback  matrix  which  allows  the  error  c  ixnics  to  be  specified 
indef^endently  of  the  values  of  the  other  parameters. 

Substituting  (3)  into  the  left  side  cf  (4),  then  substituting  (2)  into  the  resuh  and 
solving  for  x(ic+l)  gives  the  desired  next  state: 
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Xjn(k+l)  -  x(k+l)  = 

+  Tmr(k)  -  x(M)  -  {<l>m+K}e(A:) 

X(^'+1)  =  OpiXmW  +  rmr(*)  -  {<&n5+K}e(^) 

Setting  (1)  and  (5)  equal  and  solving  for  u  gives  the  control  law  that  should  be 
followed  in  order  to  achieve  the  desired  next  state. 

<!^(k)  +  r^ik)  +  ^(x(k))  +  h(x(k),k)  =  <J>^xUk)  +  r^m -  {fl»m+K}e(it)  '  (6) 

u(k)  =  r'‘{  ^m'Xmik)  +  rmr(l:)  -  {<I>m+K}e(l:)  -  -  'F(x(*))  -  h(x(l:),l:)  ) 

where,  for  a  matrix  M,  s?  is  the  pseudo-inverse  of  M.  The  only  unknown 

in  (7)  is  h.  Jf  h  changes  slowly,  then  it  can  he  approximated  by  its  previous  value. 
Solving  (1)  for  h  and  then  applying  this  approximation  yields: 

h(.<(k),ife)  =  x(jt+l)  -  ^(k)  -  ra(k)  -  W(x{k))  (8) 

h(x(k),/c)  =  x(k)  -  Ox(k~l)  -  ni(it-l)  -  Y(x()t-1)) 

Substituting  the  approximation  (9)  into  equation  (7)  gives  the  fmal  control  law 

u(k)  r^(<l>,nX(A-)  +  r^r(ifc)  -  Ke(^:)  -  <t>x(k)  -  Hf{x(k))  (1®) 

-  x(Jfe)  +  «l»x(k-I)  +  iMk~l)  +  T(x(jt-1))) 

The  controller  will  adapt  to  a  sudden  change  in  the  plant  dynamics  within  one  time 
step.  If  the  time  step  is  short,  the  controller  will  respond  faster,  but  will  also  be  more 
se,nsitive  to  noise. 
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3.5  DERIVATION  OF  THE  HYBRID  WITH  UNKNOWN  CONTROL  EFFECT 

It  is  often  the  case  that  the  exact  effect  of  control  action  on  state  is  only  partially 
known,  just  as  the  dynamics  of  the  state  are  only  partially  known.  If  a  learning  system  can 
learn  the  unmodeled  dynamics,  then  the  partial  derivative  of  the  learned  function  'F  with 
respect  to  control  action  u  will  represent  the  unmodelcd  effect  of  control  on  state,  and  can 
be  used  to  improve  the  a  priori  estimate  of  this  value  F.  The  following  is  a  derivation  of 

the  hybrid  system,  incorporating  these  partial  derivatives  as  an  improvement  over  the 

/ 

approach  in  section  3.4. 

Assume  that  a  plant  has  the  foEowing  dynamics: 

x(it+l)  =  <>x(k)  +  ru(Jk)  +  'P(x(it),  uik))  +  h(x(fc).u(jfc),^)  0 

where  the  vector  x(k)  is  the  state  at  time  fc,  the  vector  u  is  the  control,  the  matrices  <I>  and  F 
and  the  function  'P  are  the  a  priori  known  and  learned  dynamics,  and  the  function  h 
represents  all  of  the  unknowns,  including  unmodeled  dynamics,  nonlinearities  as  a  function 
of  state  or  control  acbxtn,  and  time-varying  disturbances. 

Once  again,  the  reference  system  has  the  dynamics 

Xin(fc+1)  =  +  FmK^:) 

where  r  is  the  command  vector  giving  the  state  to  which  the  plant  should  be  driven.  The 
error  between  the  actual  state  and  the  reference  state  is 

e{k)  =  Xmik)  -  x(k)  (13) 

and  the  goal  is  to  build  a  controller  that  will  cause  the  error  to  decrease  according  to: 

e(k+l)  =  {<l>^+K)e(A:) 

Substituting  (13)  into  the  left  side  of  (14).  substituting  (12)  into  the  result  of  that. 
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and  then  solving  for  x(ik+l)  gives  the  desired  state  on  the  next  time  step. 

xik+l)  =  +  r„,r(jfe)  -  {<I>„+K)e(*)  (15) 

All  known  dynamics  not  defined  by  ®  and  F  are  represented  by  the  function 
This  can  be  learned  or  stored  in  any  manner  that  allows  the  alculation  of  the  partial 
derivatives  with  respect  to  u.  When  calculating  the  u  for  a  given  time  step,  it  will  be 
necessary  to  take  into  account  the  fact  that  'P  may  affect  the  next  state  differently  according 
to  which  u  is  chosen.  Figure  3.2  illustrates  how  S'  can  be  approximated  by  evaluating  it  at 
the  current  state  and  previous  control  action,  then  forming  a  line  through  that  point  with  the 
appropriate  slope  in  the  u  direction.  Equation  (16)  shows  this  approximation 
matnematically. 


Figure  3.2  Approximation  of  S'  as  a  (linear)  function  of  u(J!:)  and  nonlinear  function  of  x 
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^(xik),  u(k))  «  'P(x(;(:),  u(k~l))  1-  {u(^')  -  u(it-l)) 


^1 

du 


ix(/i:),  u(*-l) 


(16) 


Substituting  (16)  into  (1 1)  gives  a  more  useful  formulation  of  the  plant  dynamics. 
x(it+l)  =  <I*x(ik)  +  Rifik)  +  'Ffxfjt),  u()t-l)) 


+  (u(k)-uik-l)) 


ix(ife),  »(i-l) 


+  h(k,x(k)Mk)) 


(17) 


If  the  function  h  representing  disturbances,  etc.  is  changing  slowly,  then  it  can  be 
approximated  by  solving  for  h  in  (17)  for  the  previous  time  step,  and  using  that  as  the 
approximation  of  h  for  the  current  time  step: 


h(ifc,x(J(:),u(it))  =  h(/:-l,x(it-l),u(&-l)) 

h(it,x(lk),u(it))  »  x(k)  -  <**x(ife-l)  -  ni(^-l)  -  'P(x(/t-l),  u(jt-2)) 


(18) 


-  {u(Jk-l)-u(ll'~2)} 


9^ 

^(/t-1),  u(it-2) 


Substituting  (18)  into  (17)  and  solving  for  u{k)  gives  the  control  law  in  terms  of  the 
desired  next  state  x(A:+l). 


(19) 


L 

y 

x(jt),  u(k-l)l 

u(*-l)-^ 

\x{k),  u(A:-l) 
-<I>x(*)  -  H*(x(k),  u(it-l))  +  x(it+l) 

-  x(k)  +  <I>x()(-l)  +  ni(jt-I)  +  T(x(lt-1),  u(/t-2)) 
9'Pi 


+  {u(A:-l)-u(it-2)} 


du 


lx-  u(A:-2)J 


Substituting  the  desired  next  state  (15)  into  (19)  yields  the  final  control  law: 
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u(k) 


d» 


\xik),u(k-l)j  L  '  '  u(Jk-l) 

-^xik)  -  'P(x(4'),  u(^-l))  +  OmX(^)  +  Fj^rik)  -  Ke(k) 
~  x(k)  +  Ox(k-l)  +  ni(it-l)  +  ^(x(k-l),  u(Jt-2)) 


+  {u(k-l)  -  u(k-2))  ^ 

^“Ix(ik-l).  u()t-2)J 


(20) 
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4  LEARNING  SYSTEMS  USED 


The  learning  component  of  the  hybrid  control  system  is  responsible  for  learning  the 
function  that  the  adaptive  component  discovers  a  posteriori.  Because  the  function  is 
defined  over  a  continuum  of  states,  and  can  involve  a  number  of  dimensions,  connectionist 
systems  were  chosen  for  the  learning  component.  First  a  standard  Backpropagation 
network  was  used,  as  described  in  the  next  section,  then  Delta-Bar-Dclta  learning  was 
tried,  as  described  in  the  following  section,  to  increase  learning  speed. 

4 . 1  BACKPROPAGATION  NETWORKS 

During  the  operation  of  an  indirect  adaptive  controller,  certain  parameters  are 
estimated  on  each  time  step,  and  the  controller  uses  these  to  choose  an  appropriate  control 
action.  Either  on  the  next  time  step,  or  soon  thereafter,  the  controller  may  have  additional 
information  about  what  the  estimates  should  have  been  earlier.  It  is  natural  to  consider 
whether  a  learning  system  of  some  sort  could  learn  to  map  the  earlier  state  to  later, 
improved  estimates,  and  so  be  able  to  make  even  better  estimates  the  next  time  that  state  is 
entered.  This  is  simply  a  delayed  function  approximation  problem. 

The  function  being  learned  would  output  parameters  as  a  function  of  state.  The 
parameters  and  the  state  may  be  high-dimensional  vectors,  and  the  function  being  learned 
may  need  to  be  developeil  on  tlie  basis  of  a  large  number  of  training  points  generated  by  the 
indirect  adaptive  controller.  In  this  case,  a  Backpropagation  network  would  seem  to  be  a 
good  model  for  learning  the  functions  involved.  For  any  given  function  and  desired 
at  uracy,  a  network  can  be  found  that  will  learn  that  function  to  the  desired  accuracy 
[IIW89].  This  is  true  for  networks  built  from  any  of  a  wide  range  of  functions. 
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There  are  a  number  cf  considerations  that  arise  when  trying  to  apply 
Backpropagation  networks  to  learning  functions  in  this  context.  First,  the  data  used  to  train 
the  network  comes  from  a  system  controlling  an  actual  plant.  In  this  case,  the  training  data 
consists  of  states  and  the  appropriate  parameters  that  should  be  associated  with  them.  The 
state  used  for  training  will  always  be  a  recent  state  of  the  plant,  and  since  the  state  of  a  plant 
may  not  change  much  on  each  time  step,  the  training  data  during  a  given  period  of  time  will 
all  tend  to  come  from  one  region  of  the  state-space.  This  is  even  more  applicable  in  the 
case  of  a  regulator,  where  the  controller  tries  to  keep  the  plant  near  a  particular  state  all  the 
time.  If  the  controller  is  doing  a  good  job  and  there  are  no  large  disturbances,  the  state  of 
the  plant  will  stay  near  where  it  should  be.  This  means  that  no  training  points  will  be 
generated  in  other  regions.  Even  in  a  tracking  control  problem,  the  plant  may  still  move 
slowly  through  state-space.  Therefore,  it  is  important  to  consider  the  ability  of  a  given 
learning  system  to  learn  despite  repeated  exposure  to  very  similar  training  patterns  for  long 
periods  of  time. 

Backpropagation,  and  most  of  its  variants,  all  try  to  adjust  the  weights  in  the 
direction  of  some  gradient  and  decrease  error,  as  described  in  chapter  2.  The  error  being 
minimized  J  is  frequently  defined  as  the  mean  squared  error  between  the  network  output 
and  the  desired  value  of  its  output,  summed  over  aU  possible  inputs: 

-  di) 

i=  1 

A 

Aw,  =  -a ^ — 
dw, 

where: 

j  =  Total  eiTor  for  network  with  weight  vector  w 
n  =  number  of  training  examples 
X,  =  input  to  network  for  ith  training  example 
dj  =  desired  output  of  network  for  rth  training  example 
/IXi.w)  -  actual  output  of  network  for  Ah  training  example 

This  implies  that  the  network  is  be  updated  by  epoch  learning,  where  weights  are  changed 
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once  per  epoch  (once  per  each  pass  through  all  the  training  examples).  However,  for  the 
function  approximation  being  done  here,  the  function  being  learned  is  continuous.  Even  if 
X  is  only  a  two  element  vector,  the  error  becomes 


J  - 


II 

Jxi  JX2 


(f(Xi.w)  -  dif(f[Xi.w)  -  di)  dxi  dx2 


This  requires  sununing  over  an  infinite  number  of  training  examples,  which  takes 
infinite  time,  just  to  find  the  error  associated  with  a  single  set  of  weights.  The  common 
approximation  in  this  case  is  to  use  incremental  learning.  In  incremental  learning,  the 
weights  are  adjusted  a  small  amount  after  each  presentation  of  a  training  example.  The 
change  is  made  in  the  direction  of  the  negative  gradient  of  the  error  associated  with  only 
that  one  example.  If  the  changes  are  small  compared  to  the  time  it  takes  to  sec  all  of  the 
inputs,  then  incremental  learning  will  tend  to  give  the  same  answer  that  epoch  learning 
would. 

Suppose,  however,  that  increasing  a  given  weight  would  increase  the  error  for  one 
third  of  the  training  examples  and  de/mrase  it  an  equal  amount  for  two  thirds  of  the  training 
examples;  in  this  case,  the  correct  action  would  be  to  increase  that  weight.  If  training 
examples  are  presented  in  a  random  order,  then  on  each  presentation,  there  will  be  a  one 
third  probability  that  the  weight  will  decrease  and  a  two  thirds  probability  that  it  will 
increase.  In  the  long  run,  the  weight  takes  a  random  walk  that  tends  to  increase  it  as  it 
should.  If,  however,  many  training  point.'  arc  presented  in  a  row  that  ail  have  similar 
inputs  and  outputs,  then  their  partial  derivatives  will  tend  to  be  similar,  and  they  will  all 
tend  to  move  the  weight  in  the  same  direction.  The  net  effect  of  this  is  to  cause  the  network 
to  learn  the  function  in  that  region  extremely  well,  at  the  expense  of  forgetting  any 
information  it  had  already  learned  about  other  regions.  This  phenomenon  is  referred  to 
here  as  fixation.  One  simple  method  to  avoid  fixation  is  to  use  a  buffer  to  hold  many  of 
the  training  points.  Then  on  each  time  step  a  training  point  can  be  drawn  at  random,  and 
used  to  train  the  network.  ITus  scrambling  of  the  training  points  helps  avoid  fixation,  but  it 
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may  require  a  large  memory  to  hold  all  of  the  data. 

Another  characteristic  of  Backpropagation  is  that  it  tends  to  learn  slowly.  There  are 
a  number  of  reasons  for  this,  some  of  which  are  clearer  when  the  learning  problem  is 
visualized  geometrically.  The  connectionist  network  contains  a  finite  number  of  real-valued 
weights.  This  weight  vector  determines  the  behavior  of  the  network,  and  so  the  error  is  a 
function  of  this  weight  vector.  The  error  can  be  visualized  as  a  multi-dimensional  surface 
(or  manifold)  in  a  space  with  one  more  dimension  than  the  number  of  components  of  the 
weight  vector.  A  given  weight  vector  corresponds  to  a  single  point  on  this  error  surface. 
The  height  of  the  error  surface  corresponds  to  the  mean  squared  error  associated  with  that 
vector.  If  there  is  only  one  training  point,  there  will  be  an  error  surface  associated  with  it. 
If  there  are  several  training  points,  then  there  is  an  error  surface  associated  with  each  of 
them,  and  the  sum  of  all  those  functions  gives  the  total  error  surface.  When  a  given 
training  point  is  presented  to  the  network,  it  is  possible  to  find  the  partial  derivative  of  the 
error  for  that  point  with  respect  to  each  weight.  The  negative  of  this  gradient  corresponds 
to  the  direction  of  steepest  descent  for  the  individual  error  surface  associated  with  that 
training  example.  The  sum  of  all  the  individual  gradients  gives  the  gradient  for  the  total 
error  surface. 

The  goal  of  learning,  then,  is  to  follow  the  gradient  of  the  total  error  surface, 
changing  the  weights  so  as  to  move  downhill  to  a  local  minimum  in  that  surface.  If  a 
certain  region  of  that  surface  is  shaped  like  a  trough,  then  repeated  steps  in  the  direction  of 
the  gradient  will  tend  to  cause  the  weight  vector  to  oscillate  across  the  bottom  of  the  trough, 
and  not  move  very  fast  in  the  direction  of  the  gentle  slope  along  the  trough.  If  large  steps 
are  taken,  then  it  is  possible  to  leave  the  trough  entirely,  perhaps  then  reaching  an 
undesirable  plateau  If  small  steps  arc  taken,  then  the  weight  vector  will  take  reasonable 
steps  across  the  tiough,  but  wall  move  too  slowly  along  the  trough.  Such  troughs  may 
therefore  slciw  down  convergence  of  grailient  descent,  and  so  slow  the  learning  piax'ess  in 
a  Bac  kpnvpagatu'n  network. 
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Not  only  do  troughs  slow  down  learning,  but  they  are  also  vciy  common  and  easily 
formed.  Consider  a  surface  that  has  a  number  of  roughly  circular  depressions.  If  the 
surface  is  stretched  a  hundredfold  along  one  axis,  there  will  then  be  a  large  number  of 
troughs  paiallel  to  that  axis.  In  the  error  surface  for  a  network,  each  weight  is  one  axis. 
Therefore  simply  multiplying  a  weight  by  a  large  constant  (and  backpropagating  through 
that  constant  appropriately)  can  create  troughs  in  weight  space.  Similarly,  if  one  of  the 
inputs  to  a  network  varies  (3ver  a  much  wider  range  than  another,  troughs  will  tend  to  form. 
To  avoid  this  scaling  problem,  all  experiments  for  this  thesis  were  carried  out  with  all 
inputs  and  outputs  to  and  from  networks  scaled  to  vary  ov--  range  of  unit  width. 

An  obvious  solution  to  the  problem  of  troughs  would  be  to  look  at  both  the  first  and 
second  derivative  for  the  current  weight  vector.  Instead  of  simply  calculating  the  gradient 
of  the  error  surface  at  a  point,  the  curvature  at  that  point  could  also  be  calculated.  Since  the 
gradient  changes  rapidly  across  the  trough,  the  curvature  in  that  direction  would  be  large, 
and  small  steps  in  that  direction  would  be  appropriate.  Since  the  gradient  changes  slowly 
along  the  trough,  the  curvature  is  low  in  that  direction,  and  it  would  be  safe  to  take  larger 
steps  in  that  direction.  Thus,  if  the  step  size  in  each  direction  is  decreased  in  proportion  to 
the.  curvature  in  that  direction,  then  the  modified  gradient  descent  will  tend  to  head  more 
directly  towards  the  local  minimum,  and  can  reach  it  in  less  time  with  fewer  oscillations.  If 
the  trough  is  actually  a  very  long,  thin  ellipsoid  (i.e.,  a  perfect  quadratic  function),  then 
dividing  by  the  second  derivative  would  allow  the  local  minimum  to  be  reached  in  a  single 
step. 

Figure  4. 1  illustrates  a  trough  with  a  dot  representing  the  cunent  weight  vector. 
The  arrow  pointing  to  the  right  is  the  negative  gradient,  which  points  mainly  across  the 
L'ough  and  only  slightly  along  the  uough.  Taking  discrete  steps  along  this  gradient  can 
cause  oscillation,  and  could  even  leave  the  trough  entirely  if  the  steps  are  too  large.  The 
arrow  pointing  to  the  left  is  the  negative  gradient  divided  by  the  curvature  of  the  surface.  It 
points  directly  toward  the  local  minimum  (for  this  ellipsoidal  trough),  and  is  a  better  path  to 
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follow  for  fast  convergence. 


Figure  4. 1  A  weight  vector  on  the  side  of  a  trough,  showing 
the  negative  gradient  (right  arrow)  and  negative  gradient  divided  by  curvature  Oeft  arrow) 

For  a  multi-dimensional  surface,  the  slope  is  a  vector  of  first  derivatives  (the 
gradient)  and  the  curvature  is  a  matrix  of  second  derivatives  (the  Hessian).  If  there  are  N 
weights,  then  the  Hessian  will  be  aNhy  N  matrix,  and  its  eigenvectors  will  point  in  the 
directions  of  maximum  curvature.  The  eigenvalues  correspond  to  ibe  cun'alure  in  those 
diiections.  If  it  was  useful  to  multiply  the  step  size  in  a  direction  by  tl>.e  curvature  in  that 
direction,  tlien  use  gradient  could  simply  be  multiplied  by  the  Hessian.  Unfortunately,  the 
desired  operation  is  to  divide  the  step  size  by  the  curvature.  Tliis  is  equivalent  to 
niultiplyiiig  the  gradient  by  the  inverse  of  the  Hessian: 
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Aw  -G  H-» 


G.  = 


dJ 


—  =  gradient 
dw/ 


av 

dw^awy 


=  Hessian 


./  =  totai  error 

This  involves  inverting  an  iV  by  AT  matrix  on  each  step!  This  procedure  may  be 
computationally  expensive,  so  numerous  approximations  and  heuristics  have  been 
proposed  to  accomplish  nearly  the  same  thing. 

Computation  time  is  not  tlie  only  difficulty  with  using  the  Hessian.  Implementing 
the  above  equations  requires  the  calculation  of  the  total  error  and  its  derivatives.  But  for 
continuous  function  approximation,  these  are  integrals  over  an  infinite  number  of  points. 
On  each  time  step,  the  error,  gradient,  and  curvature  can  only  be  calculated  for  one  of  tliese 
points. 


This  was  also  the  case  when  simply  following  the  gradient,  but  the  problem  was 
less  severe  then.  If  a  small  step  is  repeatedly  taken  in  the  di.'-ection  of  the  gradient 
associated  with  a  randomly  chosen  input,  tlien  over  time  the  weight  vector  will  follow  a 
random  walk  in  the  direction  of  the  true  gradient.  This  is  effective  if  the  steps  taken  are 
small,  and  gradually  get  smaller  over  time.  Now  consider  calculating  the  Hessian  on  each 
time  step,  based  only  on  the  denvatives  for  the  current  traiamg  example.  The  second 
derivatives  for  one  example  may  be  small,  even  if  the  sum  of  them  over  all  the  examples  is 
large.  The  weight  vector  would  therefore  take  large  steps  when  it  should  be  taking  small 
steps. 


In  the  case  of  a  network  with  only  one  weight,  this  problem  can  be  seen 
algebraically.  The  conect  step  size  is  the  total  gradient  divided  by  the  total  cuivature.  If  the 
steps  taken  are  actually  the  individual  slopes  divided  by  individual  curvatures,  then  the 
answer  is  completely  wrong: 


¥ 


198 


ATTACHMENT  1 


Is. 

I^ 

t 

where; 

Si  -  slope  (first  derivative) 
c,  =  curvalitrt  (second  derivative) 

The  left  side  is  correct.  The  step  size  should  be  the  total  slope  divided  by  the  total 
curvature.  The  right  side  is  incorrect.  It  is  not  useful  to  look  at  each  individual  training 
point  and  divide  its  individual  slope  by  its  curvature.  In  the  expression  on  the  left,  a  small 
c,  has  almost  no  effect,  whereas  on  tiie  right  side  it  has  a  ver>’  large  effect  When  learning 
continuous  functions,  the  summations  above  axe  actually  integrals  over  infinite  sets  of 
points.  If  weights  ai-e  changed  after  each  pass  through  all  the  training  data,  tlien  this 
problem  does  not  arise.  It  is  only  a  problem  in  incremental  training  where  the  weights  are 
changed  after  each  individual  error  is  found.  When  learning  functions  over  continuous 
input  spaces,  the  Hessian  being  mvested  should  actually  be  the  sum  of  uncouiitably  many 
Hessians.  If  it  is  simply  the  sum  of  the  last  few  Hessians  i,n.stead.  then  other  problems 
arise  since  it  is  representing  the  cuiv^ature  at  the  weight  vector  from  several  lirae  .steps 
previous  instead  of  the  current  weight  vector.  The  more  time  sicp.s  the  Hessian  averages 
over  (for  more  accuracy),  the  greater  the  danger  that  it  is  no  longer  nfeaiiingful.  It  is  not  a 
tiieoretical  certainty  that  second -order  methods  such  as  tliis  arc  more  useful  for  infinite 
training  sets  being  tiained  incrementally,  even  if  the  calculations  can  be  done  cheaply. 
Furthermore,  the  very  nature  of  self-modifying  step  sizes  may  make  the  network  more 
susceptible  to  fixation  if  the  naining  points  are  not  picked  in  a  perfectly  random  manner. 
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4.2  DEJ/r’A-BAJR-DELTA 


Backpropagation  has  been  modified  in  a  number  of  ways  by  different  researchers  as 
a  means  of  speeding  convergence  during  learning.  These  modification. s  are.  generaJ.ly 
compared  with  Backpropagation  on  toy  problems  with  small  training  sets.  Tbc  De!ta~Bai- 
Delta  algorithm,  a  heuristic  method  developed  by  Jacobs  [Jac913,  is  one  such  attempt  at 
improving  tJie  rate  of  convergence.  It  has  been  shown  by  Jacobs  md  confirmed  in  other 
work  peiformed  at  Diaper  Laboratorj'  that  this  method  sometimes  allows  faster  learning 
than  other  more  common  heuristics,  on  problems  involving  small  training  sets.  Testing  it 
on  the  learning  problem  here  allow  s  a  more  realistic  comparison  on  a  more  "real  world" 
problem,  involving  infinite  noisy  training  sets.  One  of  the  goals  of  this  thesis  is  to 
determine  the  applicability  of  methods  such  as  this  to  learning  systems  for  control. 

Delta-Bar-Delta  is  a  heuristic  approximation  to  the  effect  of  using  the  main  diagonal 
of  die  Hessian  matrix.  This  main  diagonal  contains  only  the  second  paitial  derivatives  of 
the  error  with  respect  to  each  individual  weight  with  respect  to  .itself.  Delta-Bar-Delta 
maintains  a  local  learning  rate  for  each  weight,  which  is  heuristic  approximation  of  this 
second  derivative.  The  equations  governing  Delta- Bar-Delta  [Jac91]  for  a  single  weight 
cart  be  written  as; 


wU) 

S(r) 


=  a[/) 

fiw(0 


d(t)  =  +  e 

^£(^-l)  -f-k  if  ditW)  >  0 
=  (l-^)^(^.l)  if  dit)d{t)<G 

let^-1)  if  ~  0 
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Wlicrc: 

^(0  =  a  weight  in  the  network 
e(t)  =  local  learning  rate  for  the  weight 

I 

i5(/)  =  the  element  of  the  gradient  associated  with  the  weight 
d(t)  ~  weighted  average  of  necent  S 

/  J(t)  =  total  error  in  the  network  (e.g.,  sum  of  squared  error  over  all  inputs) 

&,0,k  =  constaiits  controlling  rate  of  learning 

After  each  epcch  (pass  through  all  the  training  examples),  the  partial  derivative  of 
error  with  respect  to  each  weight  is  calculated  and  multiplied  by  the  local  learning  rate,  and 
the  weight  is  changed  by  chat  amount.  If  the  current  weight  vector  is  in  a  trough  parallel  to 
one  of  tlie  axes,  this  can  be  determined  by  the  fact  that  the  sign  of  the  gradient  in  one 
diiection  keeps  changing,  while  the  sign  of  the  gradient  in  another  direction  stays  the  same. 
The  sign  of  the  gradient  will  therefore  often  differ  from  the  sign  of  the  average  of  recent 
gradients.  Once  this  is  noticed,  the  local  learning  rate  in  the  direction  of  the  changing  sign 
iS  decreased,  and  the  rate  in  the  direction  of  the  constant  sign  is  increased.  This  has  the 
effect  of  slowing  down  wasteful  movement  across  the  trough,  and  speeds  up  movement 
along  the  trough.  If  the  trough  is  aligned  at  a  45  degree  angle  to  all  the  axes  instead  of 
parallel  to  one,  then  the  signs  of  all  tlie  gradients  will  be  constantly  changing,  and  the 
weight  vector  takes  small  steps  in  the  direction  indicated  by  Backpropagation.  This  is 
unfortunate,  but  to  compensate  fer  this  would  require  additional  storage  and  computation 
time  proportional  to  the  square  of  the  number  of  weights. 

To  see  whether  the  sign  of  the  gradient  is  changing,  Delta-Bar-Delta  keeps  track  of 
two  things:  the  current  gradient  and  an  exponentially  weighted  sum  of  recent  gradients.  If 
these  two  have  the  same  sign,  then  the  local  learnmg  rate  is  increased,  otherwise  it  is 
decreased.  There  was  one  final  heuristic:  when  the  local  learning  rate  is  raised,  it  is 
increased  linearly  by  adding  a  constant  on  each  time  step.  When  it  is  lowered,  it  is 
decreased  exponentially  by  dividing  it  on  each  time  step  by  a  constant.  Tlius  the  learning 
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rate  falls  more  quickly  than  it  rises,  and  so  when  the  nature  if  the  error  surface  changes 
often,  the  weights  will  tend  to  change  too  slowly  rather  than  too  quickly,  and  previously 
learned  information  w'ill  be  in  less  danger  of  being  erased  by  momentarily  large  learning 
rates.  The  exponential  decreasing  also  has  the  advantage  of  preventing  a  local  learning  rate 
from  ever  becoming  zero  or  going  negative,  either  of  which  would  prevent  correct 
operation  of  the  algorithm. 
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5  EXPERIMENTS 


In  the  experiments  presented  here,  a  number  of  different  combinations  of  hybrid 
control  system  components  are  tested.  Two  variations  of  an  indirect  adaptive  controller  are 
used,  both  based  on  Time  Delay  Control  [YI90].  Either  the  reduced  canonical  form  of  the 
plant  is  used,  causing  all  the  interesting  dynamics  to  be  conipressed  into  a  single  scalar 
(described  below  in  Section  5.1),  or  the  full  state  vector  form  is  used.  The  learning 
component  can  learn  initially  unmodeled  dynamics  as  a  function  of  both  state,  or  as  a 
function  of  state  and  control  action.  When  it  is  a  function  of  control  action,  then  the  partial 
derivative  of  the  learned  unmodeled  dynamics  with  respect  to  control  action  is  calculated, 
giving  an  improved  estimate  of  the  effect  of  control  on  state.  Finally,  the  learning  system 
can  be  constrained  to  learn  only  functions  whose  partial  derivatives  with  respect  to  control 
action  arc  constant  (e.g.,  the  control  enters  the  governing  dynamical  equations  linearly). 

These  various  hybrid  controllers  are  then  compared  relative  to  the  problem  of 
controlling  a  simulated  plant  having  both  spatial  dependencies  and  noise.  The  controller 
should  leam  to  control  the  plant  in  the  presence  of  spatial  dependencies  wherever  they 
occur.  As  the  plant  moves  from  one  state  to  another,  the  unmodcled  nonlinearitics  may 
apjjear  in  different  ways.  First,  they  might  apply  briefly  in  the  middle  of  the  transition 
from  one  region  of  state-space  to  another.  If  the  effect  is  short-lived,  then  it  will  have  a 
minimal  impact  on  the  trajectory  of  the  plant.  Also,  once  the  plant  leaves  the  region  where 
the  nonlinearity  has  an  effect,  it  will  have  time  to  recover  and  move  back  towards  the 
desired  trajectory. 

A  more  severe  problem  occurs  if  the  nonlinearity  appears  and  then  remains  present 
even  after  tiie  state  of  the  plant  reaches  the  desired  value.  In  this  case,  the  nonlinearity  has 
more  time  to  .affect  the  trajectory,  and  the  plant  may  never  leave  its  influence  long  enough  to 
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recover  without  adaptive  or  learning  augmentation.  If  the  nonlinearity  is  present 
throughout  the  plant's  trajectory,  then  the  problem  is  even  more  difficult.  All  three 
scenarios  am  considered  in  tine  experimenu'  below 

Finally,  the  accuracy  of  the  final  controller  is  not  the  only  issue  to  be  considered. 
Since  it  is  a  learning  system,  it  is  also  important  to  consider  how  fast  it  can  learn,  and  how 
susceptible  it  is  to  forgetting  one  region  while  exploring  another.  These  issues  are 
examined  by  the  experiments  in  the  last  section,  below. 

This  chapter  first  describes  the  plant  used  for  tlie  simulations.  The  linear  dynamical 
systems  matrices  are  then  derived  for  that  plant,  and  the  experimental  results  are  presented 
for  the  hybrid  system  in  various  configurations.  Finally,  the  Dclta-Bar-Delta  algorithm  is 
compared  with  the  standard  Backpropagation  algorithm,  and  then  a  modified  Delta-Bar- 
Delta  is  examined. 

All  of  the  experiments  below  were  based  on  a  cart-pole  plant  being  simulated  at 
50  Hz  (using  Euler  integration),  and  a  controller  running  at  10  Hz.  The  cart-pole  system  is 
shown  figures  5.1  and  5.2.  The  a  priori  knowledge  of  the  plant  was  limited  to  a 
linearized  model  of  the  system  on  the  flat  regions  of  the  tracL  The  30  degree  tilt  in  the 
region  between  1  and  2  meters  was  completely  unmodeled  and  had  to  be  either  adapted  to 
or  learned. 

Unless  otherwise  noted,  the  learning  system  in  all  the  e  tperiments  below  was  a 
Backpropagation,  sigmoid,  two  layer  network,  with  10  nodes  b  each  layer.  Connections 
were  made  from  the  inputs  to  the  first  layer,  firom  the  first  layer  to  the  second,  .md  from  the 
second  to  the  outputs.  There  were  also  connections  from  the  first  layer  to  the  outputs.  The 
inputs  consisted  of  the  four  elements  of  state:  cart  position  x,  pole  angle  6,  cart  velocity  x, 
and  pole  angular  velocity  6.  .  1  he  network  was  trained  using  the  unmodeled  dynamics 
calculated  by  the  adaptive  TDC  controller,  while  moving  the  cart  to  a  new  random  position 
in  the  range  0  to  3  meters  every  4  seconds.  In  the  case  of  the  reduced  canonical  form  of  the 
controller,  the  training  was  based  on  moving  the  cart  from  0  to  3  meters  and  back,  every  4 
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seconds. 


5 . 1  THE  CART-POLE  SYSTEM 


The  plant  used  for  the  simulations  is  based  on  a  standard  inverted  pendulum 
system.  The  problem  is  to  move  tlie  cart  to  some  desired  traci  position  by  {^plying  force 
direcUy  to  the  cart  center  of  mass,  while  at  the  same  time  balancing  a  pole  that  is  attached  to 
the  cart  via  a  hinge  (see  figures  5. 1  and  5.2). 


The  design  of  an  effective  automatic  control  system  for  the  cart-pole  object  on  the 
split-level  track  is  a  challenging  problem.  The  dynamical  behavior  of  the  nominal  cart-pole 
system  has  the  following  attributes: 

•  nonlinear 

•  open-loop  unstable 

•  nonmininium  phase 

•  4  state  variables:  (x,6,x,0) 
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The  equations  of  motion  for  this  plant  are: 

■  2 

(mc  +  mp)xstca+ nipl6cos(6~a)  -  niplO  cr)  ~  imc+ fnp)g sina  =  f- jJcSgnx 

^nipl^d  +  mpixsecaco^id-a)  -  nipglsind  = 

where: 


X 

= 

position  of  the  cart  (m) 

e 

= 

pole  angle  (rad) 

a 

- 

^rad 

o 

track  incline  angle 

g 

9.8  m/s^ 

acceleration  due  to  gravity 

nic 

= 

1.0  kg 

mass  of  cart 

nip 

O.i  kg 

mass  of  pole 

1 

= 

0.5  m 

pole  half-length 

S 

0.0005  N 

friction  between  cart  and  track 

- 

0.000002  N-  m- 

friction  between  pole  and  cait 

I/I 

< 

10.0  N 

force  apphed  to  cart 

When  the  track  angle  is  zero  (horizontal  track),  both  the  equations  of  motion  and  the 
plant  parameters  are  identical  to  those  in  [BB90]  and  [BSA83].  To  test  the  learning  ability 
of  the  system,  one  portion  of  the  track  is  set  on  an  incline,  as  shown  in  figure  5.2. 
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Figure  5.2  Cart-pole  system 


3 


X 


From  the  origin  to  1  m,  the  track  is  level.  From  1  m  tc  2  m,  the  track  slopes  down 
at  a  30  degree  angle  towards  the  2  m  mark.  From  2  m  to  3  m  the  track  is  level  again.  The 
controller  is  given  no  a  priori  knowledge  of  the  inclination  of  the  track.  It  must  adapt 
every  time  it  reaches  the  incline,  unless  it  eventually  leans  to  anticipate  it. 

TDC  allows  a  priori  knowledge  to  be  incorporated  into  the  controller.  Here,  the 
a  priori  knowledge  is  a  model  formed  by  linearizing  the  actual  plant  equations  about  the 
origin,  on  the  flat  part  of  the  track.  Assuming  small  pole  angles  {9«  1)  and  a  horizontal 
track  (a  =  0),  the  equations-of-motion  may  be  linearized,  and  the  Laplace  transform  of 
them  taken  to  yield  a  simple  transfer  function  between  force  and  cart  position: 

=  C^-3.8360)(s -I-  3.8360) 

F(s)  ~  -  3  9739)(j+  3.9739) 

Tlie  open-loop  poles  and  zeros  (the  values  of  s  where  the  above  function  is  infinite  or  zero, 
respectively)  arc  shown  in  Figure  5.3.  The  pole  in  the  right-half  plane  causes  it  to  be 
unstable:  when  left  to  itself,  the  pole  on  the  cart  generally  falls.  The  zero  in  the  right  half¬ 
plane  causes  it  to  be  nonminimum  phase:  thus  to  move  the  cart  to  the  right  when  the  p>o!e  is 
vertical,  it  is  first  necessary  to  move  it  a  small  amount  to  the  left. 
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Figvmi  5.3  Open-loop  poles  and  zeros  in  the  complex  plane 


This  linearized  model  is  incorrect  both  in  the  tilted  region  of  track  and  when  the  pole 
angle  is  large. 

Taking  the  partial  derivatives  of  the  plant  equations-of-motion  and  evaluating  them 
at  the  origin  yiel  a  linear  model  of  the  plant.  This  model  is  of  the  form: 

X  =  Ax  +  Bu 

0  0  1.0000  0 

^^  0  0  0  1.0000 
0  -0.7178  0  0 

0  15.7917  0  0 

0 
0 

0.9756 
-1.4634 

where  the  state  vector  x  =  "The  A  matrix  indicates  that  the  cart  position  and  pole 

angle  arc  the  integrals  of  cart  velocity  and  pole  angular  velocity  respectively,  iind  that  the 

cart  and  pole  velocities  are  both  proportional  to  p>olc  position.  The  B  matrix  indicates  that 

the  force  applied  to  the  plant  directly  affects  the  cart  and  pole  velocities.  It  is  often  more 

convenient  to  undertake  a  change  of  variables  in  the  above  equation  to  put  it  into  controller 

canonical  fonn.  This  form  is  found  by  first  taking  the  original  equations; 

X  =  Ax  +  Bm 
y  =  C  X 

w'here  v  is  the  output  lieiiig  controlled.  C  could  l»c  the  identity  vector,  but  for  the  plant 
being  controlled  here,  C  is  the  new  vector  (10  0  0).  A  change  of  variables  is  then 
introduced  by  substituting  T  *x  for  x  and  rearranging  the  first  equation  to  gel: 
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X  =  TAT-‘x  +  TBm 
>  =  T-‘Cx 

These  new  equations  are  then  treated  as  a  new  plant,  with  the  "A"  matrix  of  the  new 
plant  being  TAT"*  and  the  "B"  matrix  of  the  new  plant  TB.  The  new  plant  is  input/output 
equivalent  to  the  original,  one  since  varying  u  will  have  the  same  effect  on  y  as  in  the 
original  plant.  The  purpose  of  this  change  of  variables  is  to  convert  the  "A"  and  "B" 
matrices  to  the  more  convenient  form: 


X  =  AcX  +  BcU 

y  =  Ccx 


Ac 


Be 


0  1  0  0' 

0  0  10 
0  0  0  1 

0  0  15.7917  0  J 


'  0 
0 
0 

.  1  - 


Cc  =  [-14.3561  0  0.9756  0] 


This  controller  canonical  foim  has  three  important  properties:  tine  Be  matrix  is  all 
zeros  with  a  I  at  the  bottom,  the  Ac  matrix  without  its  first  columri  and  last  row  is  the 
identitv  matrix,  and  the  first  column  of  the  Ac  matrix  is  all  zeros  except  piossibly  for  tlie 
bottom  position  The  bottom  row  of  the  matrix  could  liavc  been  anything,  and  it  would 
still  have  been  in  canonical  form.  For  the  x  vector  in  this  new  canonical  form,  the  first 
element  is  the  integral  of  the  second,  tlie  second  element  is  the  integral  of  the  tlnrd,  the  thud 
element  is  the  integral  of  the  fourth,  and  the  last  element  is  a  linear  function  of  all  elements. 
The  control  action  u  only  directly  affects  the  fourth  clement  of  the  state.  This  form  is 
convenient  because  a  reduced  version  of  the  TTX^  control  law  can  be  developed  from  it  that 
involves  scalai  and  vector  algebra  and  vector  inner  prcxlucts  m.stcad  of  full  vector-matrix 
algebra  In  pamcula  the  matrix  inversion  (or  pscndei-invcrsion)  required  by  the  original 
control  law  reduces  to  a  simple  scalar  division  operation.  In  addition,  the  learning  system 
will  c>n!y  have  to  le.irn  a  scalar  output  mst-ad  of  a  four  element  vec  tor 
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The  only  complicated  part  of  the  above  trcnsforniation  was  choosing  T“'.  This  can 
be  done  in  MATLAB™  with  the  following  code: 
d  =  poly  (A) 

Tinv  =  ctrb\A,B)  *  hankeK  d  (length (d)  -  1  :  -1  :  1  )  ) 
where,  if  A  is  n  by  n,  then  d  is  a  row  vector  with  n+1  elements.  The  function  poly(A) 
returns  the  coe^lcients  of  the  characteristic  equation  of  A,  which  is  the  polynomial  formed 
by  the  determinant  of  (XI  -  A).  The  expression  "d  ( length  (d)  -  1  :  -1  :  1 
)  "  removes  the  first  element  of  d,  the  lowest  order  coefficient,  and  reverses  the  rcmaiiring 
elements.  The  function  hankel  returns  an  n  by  n  matrix  that  has  its  fust  column  equal  to 
this  list  and  all  zeros  below  the  first  anti-diagonal.  Each  element  of  the  matrix  equals  the 
element  one  below  and  to  the  left  of  it.  Finally,  ctrb  returns  the  n  by  n  controllability  test 
matrix  (a  row  of  column  vectors)  formed  fi-om  the  n  by  n  matrix  A  and  the  n  by  1  vector  B 
by: 


ctrb(A,B)  =[B  AB  A^B  A^B  ...  A^-iB) 


In  discrete-time,  the  full  system  is  approximated  by: 


Xk+i  =  O  Xk  +  r  Uk 


At  50  Hz: 


1.000 

-0.0001 

0.02 

O' 

®50  - 

0 

1.0032 

0 

0.0200 

0 

-0.0144 

1 

-0.0001 

0 

0.3162 

0 

1.0032J 

At  10  Hz: 

1.000 

-0.0036 

0.1000 

-0.000  r 

0 

i.OSOO 

0 

0.1027 

0 

-0  07.37 

1.0000 

-0.0036 

0 

i.62Il 

0 

1  0800- 

r5o= 


0.0002' 

-0.0003 

0.0195 

-0.0293 


Fio^ 


0.0049" 

-0.0074 

0.0977 

-45.1502 


The  behavior  of  the  reference  model,  in  discrete  time,  is  given  by: 
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Xk+i  =  Om  Xk  +  r,p  Uk 


At  50  Hz: 


^M50  = 


At  10  Hz: 


^WIO  = 


1.0002 

-0.0003 

0.0184 

-0.0276 

1.0038 

-0.0058 

0.0654 

-0.101:2 


0.0048 

0.9958 

0.4712 

-0.4129 

0.1077 

0.9108 

2.0162 

-1,5989 


0.0204 

-C.00C5 

1.0343 

-0.0515 

0.1072 

-0.0109 

1.1249 

-0.1931 


0.1201 

0.8226 

0.0276 

0.0606 

0.5176 

0.2770 


-0.C002' 

^M50  = 

0.0003 
-0.0184 
.  0.0276. 

-0.0038' 

0.0058 

-0.0654 

-  0.1012. 

The  error  feedback  gain  K  is  zero.  This  means  that,  given  the  plant's  state  at  time 
k,  the  desired  state  foi  the  plant  at  time  ik+1  will  always  be  equal  to  the  state  that  the 
reference  model  would  have  at  time  Jk+1  if  it  started  at  the  state  where  the  plant  is  at  time  k. 
In  other  words,  for  a  given  commanded  state,  there  will  be  a  set  of  almost  parallel 
trajectories  tfirough  state-space,  which  are  the  paths  that  the  reference  model  would  take. 


At  any  given  point  in  time,  the  desired  dynamics  for  the  plant  is  simply  to  follow  the 
reference  trajectory  tiiat  it  is  currently  on.  If  K  was  greater  than  zero,  then  the  desired 


dynamics  of  the  plant  could  be  faster  than  the  dynamics  of  the  reference  model.  The 


controller  would  then  have  to  maintain  a  reference  moctel  internally.  On  the  first  time  step, 
the  state  of  the  reference  would  be  set  to  tlic  state  of  the  plant.  On  each  time  step  thereafter, 
the  reference  model  would  be  updated  according  to  the  reference  dynamics.  If  the  plant 
state  matched  the  reference  stale,  the  desirr  d  next  state  of  the  plant  would  be  equal  to  the 
desired  next  state  of  the  reference.  If  the  plant  ever  got  off  of  the  reference  path,  then  it 
would  not  start  following  a  new  reference  path,  but  would  instead  try  to  get  back  on  to  the 


original  path.  This  integra'  .ng  kind  of  behavior  acts  to  keep  small  errors  in  the  contro.ller 
from  building  up  over  time.  Although  the  atloed  complexity  of  a  nonzcio  K  i.s  never  used 
in  the  experiments  presentrd  here,  u  would  be  easy  to  add  the  terms  for  a  nonzero  K  into 
the  hybrid  controller.  In  fact,  the  equations  derived  above  explicitly  comahi  the  terms  for 
K,  even  though  they  are  never  used  here 
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5 . 2  ORGANIZATION  OF  THE  EXPERIMENTS 

The  cart-pole  track  is  horizontal  eveiywhere  except  between  the  points  at  i  meter 
and  2  meters.  The  30  degree  incline  in  tliis  region  is  a  large,  unmodeled  nonlinearity,  and 
so  is  a  more  difficult  region  for  the  controller  unless  the  learning  component  is  v/orking 
well.  When  the  cart  staits  at  0  meters  and  is  told  to  move  to  3  meters,  most  of  the 
complicated  maneuvering  and  acceleration  will  be  executed  near  the  start  and  end  of  tlie 
trajectory,  both  of  which  are  on  the  well-modeled  level  pait  of  the  track.  This  trajectory  is 
therefore  easier  for  the  adaptive  controller  than  moving  from  0.8  to  1.3  meters,  where  it 
would  have  to  cross  the  border  of  the  nonlinearity  almost  immediately,  and  would  then 
have  to  stop  on  the  incline  near  the  edge.  The  following  sections  are  organized  around 
trajectories  of  differing  difficulty:  (i)  noniinearities  in  the  middle,  (ii)  at  the  end,  or  (iii)  at 
the  start,  middle,  and  end. 

In  all  experiments,  the  inclined  portion  of  the  track  is  between  the  1  tmd  2  meter 
mark.  Section  5.4  shows  results  for  the  cart  moving  from  0  meters  to  3  meters.  Section 
5.5  shows  results  for  the  trajectoiy  from  0  to  1.3.  Section  5.6  shoves  resu.Us  when  going 
from  0.8  to  1.3,  and  also  for  going  from  1.3  to  1.9 

The  networLs  were  trained  from  data  geneiateii  as  the  cart  war  commanded  every  4 
seconds  to  move  to  a  new  random  jx*sition  l.*et\vcen  0  meters  and  3  meters.  The  graphs 
show  the  petforanuice  of  the  hybrid  over  a  9  "econd  period,  after  learning  had  already 
occurred.  Two  hybrid  systems  are  compared-  the  rcau.  ed  hyb  Id,  which  learns  a  scahui- 
version  of  uhe  unmodeled  dynamics  associated  with  the  canonical  system  model  described 
above,  and  the  full  hybrid,  which  learr  s  the  vector  form  of  tlie  unmode’ed  dynamics. 

In  sections  5.3,  3.4,  and  5.3,  tlie  *"011  hybrid  uses  input/ourput  partial  derivarive 
information  from  a  network  thai  is  constraiiie  to  have  an  output  calculated  as  a  gruerai 
nonlinear  function  of  x,  and  a  constant,  linear  function  of  u.  This  network  wat;  used 
because  it  was  found  to  give  better  perionnance  than  a  network  calculating  outputs  as  a 
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general,  nonlinear  function  of  both  x  and  u.  For  the  sake  of  comparison,  one  run  of  the 
general  network  is  shown  in  section  5.6.  There  is  also  one  experiment  shown  in  section 
5.6  for  the  case  of  extremely  noisy  sensors,  which  is  included  to  demonstrate  that  both  of 
the  hybrid  controllers  can  continue  to  work  under  extremely  noisy  conditions. 

The  results  are  shown  throughout  this  chapter  in  a  consistent  format.  The  position 
graphs  show  the  position  of  the  reference  cart  on  the  track  In  meters,  as  well  as  the  position 
of  the  cart  controlled  by  the  full  and  reduced  hybrid  controllers.  The  other  type  of  graph 
shows  the  error  in  position  (reference  minus  acmal)  in  meters,  and  the  force  applied.  The 
force  is  scaled  by  a  factor  of  ten,  so  that  the  range  of  the  graph  corre.sponds  to  the  full 
±10  N  range  of  admissible  forces  that  can  be  applied  to  the  cart-pole  system. 

5 . 3  MED-TRAJECTORY  SPATIAL  NONLINEARITIES 

The  first  set  of  experiments  were  intended  to  test  the  ability  of  the  hybrid  controllers 
in  the  presence  of  spatial  dependencies  appearing  in  the  middle  of  tlie  plant  s  trajectory.  In 
addition  to  inherent  nonlinearities  in  the  cart-pole  system,  a  further  nonlinearity  was  added 
by  tilting  the  track  30  degrees  in  the  region  from  1  meter  to  2  meters.  In  these  first 
experiments,  the  cart  was  commanded  to  move  from  its  initial  position  at  0  meters  to  a  final 
position  at  3  meters,  while  following  a  desired  trajectoiy  through  state-space,  and  without 
allowing  the  pole  to  fall  over.  Since  it  spent  relatively  little  time  in  die  i.rjclined  legion,  and 
since  it  alwa>s  left  that  region  before  it  came  close  to  the  final  state,  tins  setup  intrtxluctd 
niid- tiajectory  spatial  nonlinearities. 
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Figure  5.4  Plain  TDC,  ftom  0  to  3  meters 
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Figure  5.5  Reduced  ITXJ;  force  and  posiUon  error 
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Figure  5.6  Full  TDC;  force  and  position  error 


Figure  5.4  demonstrates  the  difficulty  of  this  control  task  for  TDC  alone,  without 
learning.  Both  the  reduced  and  full  versions  of  TDC  are  able  to  balance  the  pole,  but  they 
do  not  follow  the  desi»'ed  trajectory  ver>  closely.  For  the  reduced  version,  figure  5.5 
shows  that  there  were  not  very  large  errors  in  the  cart  position  until  after  the  actuator  started 
to  saturate  at  -10  N.  If  it  could  have  applied  more  than  that  level  of  force,  it  might  have 
done  [letter.  The  full  IlXii  had  equally  bad  errors,  but  did  not  attempt  to  apply  more  torcc 
than  was  possible. 

Figures  5.7,  5.8,  and  5.9  depict  the  same  experiment,  but  with  the  hybrid 
controller. 
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Figure  5.7  Full  and  reduced  hybrid  systems,  from  0  to  3  meters 
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With  the  aid  of  learning,  the  hybrid  controllers  performed  extremely  well.  The 
reference  and  iictual  trajectories  were  almost  completely  on  top  of  each  other,  and  appear  to 
be  a  single  curve.  The  full  hybrid  is  comparable  to  the  reduced  version,  although  the 
reduced  version  tracked  the  reference  slightly  better.  It  is  interesting  that  altliough  the  full 
TDC  attempted  to  use  less  force  thaii  the  reduced  TDC,  the  full  hybrid  applied  more  control 
action  than  the  reduced  hybrid.  In  fact,  the  full  hybrid  tends  to  oscillate  in  its  application  of 
control  action,  even  though  the  cart  itself  did  not  oscillate  visible. 

The  experiments  in  this  section  demonstrated  three  things.  First,  an  adaptive 
controller  can  be  improved  significantly  when  used  in  a  hybrid  architecture  with  a  learning 
system.  Second,  in  some  cases,  such  as  the  reduced  canonical  form  shown  here,  simply 
learning  unmodeled  dynamics  is  enough  to  give  acceptable  performance.  In  other  cases, 
such  as  in  the  full  (noncanonical)  form,  the  performance  is  not  very  good  unless  tiie 
input/oiitput  partial  derivatives  of  the  learned  function  are  also  used,  and  the  network  itself 
IS  modifieci  for  this  purpose. 
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5.4  TRAJECTORY-END  SPATIAL  NONLINEARITIES 

The  simulations  in  sections  5.3  were  all  conducted  while  commanding  the  cart  to 
move  from  0  meters  to  3  meters.  Since  dte  unexpected  tilt  in  the  track  was  tKJtween  1  and  2 
meters,  the  learning  system  was  mainly  beneficial  during  the  brief  period  that  the  track  was 
on  the  incline.  Any  eiTors  introduced  into  tlic  state  during  that  period  can  be  hand.led  after 
the  plant  has  moved  on  to  a  region  where  its  a  priori  model  is  more  accurate.  A  more 
difficult  problem  occurs  when  the  cart  is  commanded  to  move  from  0  meters  to  1.3  meters. 
Then  the  spatial  dependencies  are  important  at  the  end  of  the  trajectory,  when  the  cart 
should  be  decelerating  and  settling  in  on  the  final  state.  This  section  compares  the  behavior 
of  die  reduced  and  full  TDC  and  hybrid  controllers  in  this  more  dilficult  situation. 

Figures  5.10,  5.11,  and  5.12  show  plain  TDC  trying  to  move  the  can  from  0  to  1.3 
meters.  Both  controllers  arc  fine  until  they  reach  the  edge  of  the  incline  at  1  meter.  At  this 
point  they  are  trying  to  decelerrte  since  they  are  near  the  goal.  The  unexpected  acceleration 
causes  the  pole  to  fall  back,  and  the  cart  roust  then  back  up  past  the  edge  to  keep  it  from 
falling.  This  sets  up  the  oscillations  around  the  1  meter  mark  which  are  visible  in  the 
figures.  The  reduced  canonical  form  TDC  eventually  allows  the  pole  to  fall  over,  while  the 
full  ITX)  eventually  reaches  the  goal,  but  only  after  10  seconds  of  oscillations.  This  is 
exactly  the  kind  of  situation  for  which  the  integration  with  the  ’earning  system  would  be 
expected  to  be  most  valuable. 

Figures  5.13,  5.14,  and  5.15  show  the  hybrid  controllers  performing  much  better 
on  the  same  problem.  Not  only  is  die  performance  better,  but  it  is  accomplished  using  less 
force,  and  saturating  less  often. 
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Figure  5.13  Full  and  reduced  hyorid  systenns;  from  0  to  1.3  meters 
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Figure  5.14  Reduced  hybrid,  from  0  to  1.3  meters;  force  and  position  error 
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Figure  .5.15  Full  hybrid,  frum  0  to  1.3  meters;  force  and  position  enor 


In  this  problem,  the  full  hybrid  follows  the  reference  more  closely  ami  overstuxrLs 
less  than  the  leikiccd  hybrid.  It  is  not  surprising  that  the  full  controller  is  lx.‘tter  than  thf- 
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reduced  one  in  this  case,  but  not  in  the  case  of  moving  from  0  to  3  meters.  When  the 
nonlinearity  affected  the  trajectory  only  for  a  brief  period  of  time  in  the  middle,  any  learning 
system  that  could  predict  that  nonlinearity  at  all  could  do  a  good  job.  However,  in  the  more 
demanding  problem  of  stopping  the  cart  on  the  incline  near  the  edge,  the  exact  nature  of  the 
nonlinearities  on  the  slope  become  more  important.  In  this  case,  it  is  more  important  to  get 
better  estimates  of  the  effect  of  control  on  state,  by  using  the  partial  derivatives  of  the 
function  that  was  learned. 

5.5  TRAJECTORY-START  AND  TRAJECTORY-END  NONLINEARinES 

It  has  been  shown  above  that  there  is  a  performance  improvement  created  by  using 
partial  derivative  information  in  the  hybrid  A  more  difficult  control  problem  arose  when 
the  transition  in  or  out  of  the  tilted  region  occurred  near  the  end  of  the  tiajcctory,  since  that 
is  the  point  that  tlie  cart  starting  to  slow  down  and  settle  in  to  the  correct  position.  The 
improvement  from  using  partial  derivative  information  was  even  more  pronounced  in  this 
more  difficult  problem.  In  this  section,  a  yet  more  difficult  problem  is  considered,  where 
the  cart  is  commanded  to  move  from  1.8  meters  to  2.3  meters  This  trajectory  is  shon,  so 
when  the  cart  crosses  the  boundaiy  of  the  lilted  region,  this  event  is  both  near  the  start  of 
the  run  and  near  the  end  of  it.  Figures  5.16,  5.17  and  5.18  compare  the  behavior  of  the 
reduced  and  full  hybrid  ccnlrolkrs.  The  commanded  path  was  from  0.8  meters  to  1.3 
meters. 
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Figure  5.16  Full  and  reduced  hybrid  systems;  from  0.8  to  l.j  meters 
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Figure  5  17  Reiluced  hvbrid,  from  0.8  to  I  ..^  meters;  force  and  (x'l.sdton  error 
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Figure  5.  I  S  Full  hybrid,  from  0.8  to  1.3  meters;  fores*  and  position  error 

As  expected,  the  incorporation  of  the  partial  derivative  information  has  a  more 
dramatic  effect  here  than  it  did  in  the  previous  problems.  FieUie  5.16  shows  no  overshoot 
at  all  for  the  full  hybrid,  as  compared  to  a  large  overshoot  for  the  reduced  hybrid.  As 
before,  the  force  applied  by  the  full  hybrid  was  greater  than  the  force  applied  by  the 
reduced  controller. 
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Figure  5.19  Plain  TDC,  from  0.8  to  1 .3  meters 
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Figure  5.21  Full  TDC,  from  0.8  to  1..3  meters;  force  and  position  error 
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The  performance  of  the  hybrid  is  more  impressive  when  compared  with  the  result 
of  plain  TDC,  as  shown  in  figures  5.19,  5.20,  and  5.21.  Not  only  were  the  oscillations 
extreme,  but  the  pole  actually  fell  over  after  5  f.»r  6  seconds. 

The  same  experiment  was  repeated  commanding  the  controller  to  go  from  1.3 
meters  to  1 .9  meters.  This  ens  rred  that  the  entire  ti  ajectoi7  was  on  the  inclined  region  of 
the  track,  and  so  the  learning  corniwnent  was  very'  important.  The  performance  of  the 
hybrid  is  shown  in  figures  5.22,  5.23,  Jind  5.24, 
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Figure  5.22  Full  and  reduced  hybrid;  from  1 .3  to  19  meters 
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Figure  5.23  Reduced  hybrid,  tiom  1.3  tu  1.9  nwters;  force  and  fxrsiticn  error 
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Figure  5.24  Full  hybrid,  from  1.3  to  1.9  meters;  force  and  position  error 


The  value  of  the  extra  partia]  derivative  infonnation  in  the  full  hybrid  controller  is 
exccptioiiidly  clear  in  figure  ,5.22.  The  ftill  hybrid  gives  very  acceptable  performance, 
while  die  reduced  hybrid  actually  goes  into  a  lin>it  cycle  rhat  continues  indefinitely.  This  is 
due  to  the  tact  that  small  errors  made  near  the  edge  of  die  incime  tend  to  cause  the  cart  to  go 
across  die  boundary,  thus  g' early  increasing  the  errors  and  inducing  further  crossings  and 
further  errors.  The  final  results,  in  figure  5  2S,  5.26,  and  5.27,  are  the  graphs  for  the 
same  experiment  with  just  plain  ITXT  and  no  learning. 
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Figure  5.25  Plain  TDC,  from  1.3  to  1.9  meters 
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Figure  5.26  Reduced  TDC,  from  13  to  1.9  nieiers;  force  and  position  eiToi 


‘?OQi 


AITACHMENT  1 


g 

UJ 

O) 

c 

ISc 

o 

m 

H* 

t: 

(0 

o 


g 

I 

£ 

c 

o 

O 

"D 

ffl 


o 

z 


0.5 


-0.5 


full  TDC  [1 .3.1.9]  (m) 

- -  control  action 

- .  tracking  error 


4  5 

Time  (sec) 


Figure  5.27  Full  TDC,  from  1.3  to  1.9  meters;  force  and  position  en-or 


5.6  NOISE  AND  NONI.INEAR  RINCTIONS  OF  CONTROL 

The  preceding  three  sections  showed  systematic  testing  of  the  two  best  nybrid 
architecfjires  found,  This  section,  for  tlie  sake  of  comparison,  shows  one  run  with  a  worse 
hybrid  architecture,  and  one  run  with  the  best  architectures  in  an  unreasonably  noisy 
envirorunent. 

Figures  5.28,  5.29,  and  5.30  show  the  results  for  the  hybrid  controllers  in  an 
unreasonably  noisy  environment.  On  each  lime  step,  zero-mean,  Gaussian  noise  was 
added  to  each  sensor  reading.  For  each  clement  of  the  state  vector,  the  noise  had  a  variance 
equal  to  10%  of  the  total  range  that  the  element  normally  vanes  over,  while  following  that 
trajectory.  In  practice,  if  an  actual  system  had  sensors  this  noisy,  they  would  be  filtered  by 
a  sepal  ate  algorithm,  Neverthele.ss,  it  is  interesting  to  note  that  the  hybrid  is  relatively 
insensitive  to  noise,  and  that  it  still  performs  well. 
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ure  5.28  Full  and  reduced  hybrid  systems  with  10%  variance  noise;  cart  position  plot 
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Figure  5.29  Reduced  hybrid  with  10%  variance  noise;  force  and  position  error 
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Figure  5.30  Full  hybrid  with  10%  variance  noise;  force  and  position  error 

As  can  be  seen  from  figure  5.28,  botli  hybrid  systems  did  extremely  well.  The  full 
hybrid  was  slightly  better  than  the  reduced  hybrid,  but  needed  to  ^ply  more  force.  The 
performance  difference  was  probably  mainly  do  to  the  fact  that  the  actuator  saturated  for  a 
longer  period  in  the  case  of  the  reduced  controller,  so  that  it  was  not  able  to  apply  as  much 
force  as  it  calculated  was  actually  needed. 

When  the  algorithm  for  the  full  hybrid  was  first  developed,  the  network  was 
allowed  to  learn  a  general,  nonlinear  function  of  both  x  and  u.  The  results  of  using  such  a 
network  aie  shown  in  figures  5.31  and  5.32. 
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Figure  5.31  Full  hybrid;  general  nonlinear  function  of  both  x  and  u;  cart  position  plot 
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Figure  5.32  Full  hybrid;  genera]  nonlinear  function  of  both  x  and  u;  force,  position  error 


Although  the  pole  never  fell,  the  controller  did  not  follow  tire  reference  path  very 
closely.  This  controller  wis  actually  worse  than  TDC  by  itself.  The  problem  arises 
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because  control  action  is  one  of  the  inputs  to  the  network.  In  general,  to  find  the  correct 
control  action  to  achieve  the  desired  state,  it  is  necessary  to  find  the  inverse  (with  respect  to 
u)  of  the  function  implemented  by  the  network.  This  can  be  a  difficult  problem. 
Nevertheless,  because  the  unmodeied  dynamics  may  often  be  considered  as  a  fairly  linear 
function  of  control,  it  should  be  possible  to  approximate  the  function  in  a  given  state  as  a 
linear  function  of  control  action.  In  other  words,  taking  into  account  the  unmodeled 
dynamics  associated  with  the  control  action  on  the  last  time  step,  and  assuming  the  partial 
derivative  of  Y  with  respect  to  u  has  not  changed  much,  it  should  be  possible  to  calculate 
the  appropriate  u  for  the  current  time  step.  When  this  idea  was  implemented,  however,  it 
did  not  make  any  significant  difference.  ITiis  may  have  been  because  the  network  actually 
learned  as  a  nonlinear  function  of  m.  If  a  function  is  almost,  but  not  quite,  a  line,  then 
even  if  the  distance  between  the  function  and  the  line  is  small  everywhere,  the  difference 
between  their  slopes  may  be  large.  Learning  tlie  nonlinear  function  and  then  using  its  slope 
at  some  point  evidently  did  not  give  enough  new  information  to  help  much.  A  better 
approach  would  be  to  have  the  network  learn  the  best  linear  function  of  u,  and  then  look  it 
the  partial  derivatives  of  this  linear  function.  Of  course,  Y(x,n)  could  still  be  a  nonlinear 
function  of  x,  and  would  only  be  constrained  to  be  a  linear  function  of  u.  Accordingly,  a 
network  was  set  up  to  leam  MP'  as  a  possibly  nonlinear  function  of  x  and  a  linear  function  of 
u.  In  fact,  this  alternative  arrangement  was  used  for  the  full  hybrid  experiment  shown  in 
figure  5.7,  and  can  be  seen  to  be  significantly  better  than  the  hybrid  used  in  figure  5  31. 

Sections  5.1  through  5.6  have  explored  several  different  approaches  to  combining 
learned  information  with  an  adaptive  controller.  Using  input/output  partial  derivative 
information  in  the  control  law  seemed  to  be  helpful,  but  only  if  the  network  was 
constrained  to  leam  functions  nonlinear  in  x  and  linear  in  u.  Using  the  reduced  canonical 
fonn  had  the  advantage  of  allowing  the  network  to  leam  a  function  with  one  output  instead 
of  four,  and  worked  well  enough  that  the  partial  derivatives  were  not  needed.  This  system 
worked  better  for  scenarios  with  computation  delay  and  actuator  dynamics,  and  worked 
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equally  well  in  the  presence  of  noise.  Overall,  the  full  hybrid  using  partial  derivatives 
tended  to  be  thic  most  effective  controller,  especially  when  tltue  trajectory'  of  the  plant  was 
largely  in  the  re  gion  of  greatest  unmodeled  dynamics. 

5 . 7  COMP./5 JIISON  OF  CONNECTIONIST  NETWORKS  USED 


5.7.1  Sigmoid 

The  network  used  in  most  of  the  above  experiments  was  a  Backpropagation,  two 
layer,  sigmoid  network.  Each  of  the  inputs  and  outputs  of  the  network  were  scaled  before 
entering  and  after  leaving  it,  so  that  each  signal  would  vary  over  a  range  of  unit  width. 
Thus,  the  network  would  give  equal  preference  to  errors  in  each  output.  After  trying 
several  different  learning  rates,  it  was  found  that  a  rate  of  0.005  worked  best.  The 
following  graph  (figure  ,5.33)  shows  the  learning  curve  for  the  network  while  learning  the 
function  ¥(x,m),  where  'P  was  a  nonlinear  function  of  both  x  and  u.  The  network  output 
^  is  a  four  element  vector  with  one  element  for  each  of  the  four  elements  of  state.  The 
graph  shows  the  base  10  logarithm  of  the  error  in  the  network's  output,  as  a  function  of  the 
training  cycle. 
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Semiiog  Plot  of  Error  During  Learning 
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Even  though  each  point  in  the  curve  is  the  average  error  in  the  output  over  a  period 
of  400  training  points,  the  curve  still  appears  to  be  very  noisy.  This  noise  tends  to  cause 
the  network  to  forget  what  it  has  learned  unless  the  learning  rate  is  fairly  low,  and  so  this 
noise  is  probably  the  reason  that  a  learning  rate  of  0.005  was  the  largest  rate  that  converged 
to  a  local  minimum.  Higher  leaniing  rates  modified  the  weights  so  much  on  every  step  that 
they  changed  enough  to  forget  previously  learned  information.  Lower  learning  rates 
caused  the  network  to  learn  even  more  slowly  than  in  figure  5.33.  The  training  penod 
shown  in  the  figure  took  approximately  63  hours  to  run  on  a  Macintosh  Hfx.  Figure  5.34 
shows  a  three-dimensional  slice  of  the  six-dimcnsional  surface  learned.  In  the  figure,  the 
three  elements  of  state  not  shown  ate  held  at  zero. 
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The  figure  shows  each  “lement  of  as  a  separate  surface,  w'itli  all  heights  scaled  to 
nt  in  the  cube.  T  ie  horizontal  axis  is  the  conox)!  action  u,  and  the  diagonal  axis  is  the  cart 
position  X.  The  function  is  clearly  nonlinear  and  widely  varying  in  both  of  these 
dunension,  although  it  varies  little  along  the  other  dimensions  that  arc  not  shown 

As  TLK'  generated  new  training  points,  these  were  stored  sii  a  buffer  The  network 
was  trained  with  points  randoiidy  (.Irawn  from  tins  buffer.  This  was  done  to  ensure  that  the 
network  would  no(  have  problems  with  receiving  a  long  string  of  trairnng  points  all  from 
the  same  icgion,  thereby  causing  ii  to  forget  what  it  had  already  learned  in  other  regions. 
r.Vspite  this  random  buffer,  the  network  still  ic;.inicd  extremely  slowly. 

A  controller  based  on  this  app;uaeh  wunki  need  one  of  three  things  to  he  praclicaJ. 
hirst.  It  could  have  sjieeia!  fuirdware  to  sfiecd  up  the  learning.  Second,  it  might  be  in  a 
siiuatmn  whete  long  learning  luces  are  aecepl.ioJe.  If  a  tactoiy  iob»  t  can  learn  to  adpist  to 
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normal  wear  within  a  few  days,  then  it  should  be  able  to  leam  any  uiimodeled  dynamics 
due  to  wear  faster  than  they  occur.  Hiird,  die  algorithm  in  the  network  might  be  modified 
to  allow  faster  learning.  This  third,  approach  was  taken  here. 

5.7.2  Sigmoid  With  a  Second-Order  Method  (Deha-Bar-Delia) 

One  attempt  to  speed  learning  was  to  apply  r.  pseudo-Newton  method  to  the 
sigmoid  network.  Delta-Bar-Delta  [Jac91]  was  chosen  because  it  requires  very  little  extra 
computation  and  it  has  been  compared  favorably  witli  a  number  of  other  methods. 
Unfortunately,  comparisons  between  methods  to  speed  learning  are  often  done  with 
benchmark  problems  that  do  not  represent  the  problem  here.  People  often  compare 
learning  speeds  for  learning  an  XOR  function  or  a  multiplexor  function.  These  can  be 
difficult  problems  for  a  network  to  leam,  but  the  network  has  me  advantage  that  the  set  of 
tr  lining  points  is  finite  and  small,  so  it  is  not  unreasonable  to  change  weights  only  after 
each  cycle  through  all  the  training  data.  Learning  a  function  defined  over  a  real  vector  is 
more  di,ff5cult,  since  there  is  an  infinite  set  of  training  points.  The  functions  encountered  in 
this  thesis  tended  to  be  smooth  and  have  few  wrinkles,  which  meant  that  mere  was  a  large 
amount  of  redundancy  in  the  data  that  the  learning  algorithm  shotbd  have  been  aole  to 
exploit.  These  factors  combined  to  yield  a  problem  that  was  slow  for  Backpropagation 
alone  to  leam,  but  should  have  been  Icamable,  quickly  by  other  learning  methods. 

Wiien  Delta-Bai-Lk;ha  was  first  applied,  it  immediately  set  all  of  the  local  learning 
rates  to  zeio,  causing  tlie  weights  to  freeze.  This  was  because  it  worked  by  comparing  the 
current  partial  erivative  of  error  (with  respect  to  a  given  weight)  with  an  expioneDtial 
average  of  recent  values  of  this  derivative.  Since  this  was  fc«ijng  done  after  every  tfainmg 
point,  it  saw  the  noise  in  the  training  data  and  imerpreted  that  as  rapidly  changing  signs  in 
the  error  derivatives.  It  resjxinded  to  that  by  repeatedly  decreasing  all  of  the  local  learning 
rates. 

Thi:  problem  arose  beesuse  Delta-Bar-Delta  was  not  being  used  in  an  epoch 
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training  mode  as  it  had  been  designed  for.  The  apparent  solution  was  to  calculate  two 
exponentially  smoothed  averages  of  the  error  partiais.  If  these  two  averages  had  different 
time  constants,  then  comparing  them  would  be  like  coniparing  the  current  true  denvative 
with  a  slightly  older  true  derivative. 

The  values  for  the.se  time  constants  were  chosen  hecristically.  Looking  at  tlie 
learning  curve  for  normal  Backpropagation  showed  that  the  errors  were  noisy,  but  in  a  500 
triining  point  period  a  "representative  sample"  of  training  points  was  probably  being  seen. 
The  short-term  average  was  therefore  chosen  so  that  80%  of  the  average  was  determined  by 
the  last  500  training  points.  The  long-term  average  was  then  chosen  to  be  5  times  slower, 
bas  ig  80%  of  its  value  on  the  last  2500  training  points.  In  normal  Delta-Bar-Delta,  the 
leai  ling  rale  is  increased  by  a  constant  every  time  tliC  current  derivative  has  the  same  sign 
as  the  Icng-term  derivative  average.  Since  this  variation  would  update  learning  rates  about 
5CK  limes  more  often,  the  rate  of  increase  for  learning  rates  was  set  500  times  smaller  than 
IS  suggested  for  normal  Delta-Bar-Delta.  Similarly,  when  learning  rates  arc  decreased,  the 
decrease  is  implemented  exponentially  by  dividing  by  a  constant  each  time.  Since  the 
modified  Delta-Bar-Dc  '  ta  would  be  expected  to  divide  oy  this  constant  500  times  as  often, 
the  500th  rr'  jt  of  the  suggested  constant  was  used. 

T'tiere  arc  two  novel  v/ays  that  Delta-Bar-Belta  can  fail,  If  local  learning  rates  are 
increased  too  often,  then  they  get  very  large,  and  weights  in  the  network  can  start  to  blow 
up.  On  the  other  hand,  if  local  learning  rates  are  decreased  too  often,  then  they  rapidly 
approach  zero,  and  the  weights  freeze.  If  the  local  leiiming  rates  stay  w  ithin  a  reasonable 
range,  then  Delta-Bar-Delta  can  succeed  or  fail  in  tlie  same  manner  as  Backpropagation, 
although  hopefully  it  reaches  the  final  state  faster. 

In  experimenting  with  Delta-Bar- Delta,  evt  y  run  either  had  exploding  weights  or 
variLshing  learning  rates.  Given  the  very  noisy  tnuning  data  that  the  network  was  ex|x>sed 
to,  1  was  unable  to  find  a  irseful  set  f  parameters  for  Delta-Bar-DeUa.  It  is,  o  course, 
possible  that  such  a  set  of  paiamcters  exists.  Perhaps  Delta-Biur-Deha  would  vvor.k  bvtte-  it 
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all  the  local  learning  rates  were  nonnalized  on  each  time  step  to  keep  a  constant  average 
value,  or  perhaps  some  other  heuristic  might  be  applied.  It  is  not  immediately  clear  what 
would  be  the  best  way  to  deal  with  this  problem. 

/ 

w 
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6  CONCLUSIONS 


RECOMMENDATIONS 


6.).  SUMMARY  AND  CONCLUSIONS 

This  thesis  has  described  a  new  method  for  integrating  an  indirect  adaptive 
controller  with  a  learning  system  to  form  a  hybrid  controller,  incorporating  the  advantages 
of  each  system.  When  a  learning  system  is  iraiiied  with  the  estimates  round  by  the  adaptive 
controller,  the  hybrid  system  reacts  more  quickly  to  unmodeled  spatial  dependencies  in  the 
plant.  This  hybrid  system  fellows  a  reference  trajectory  better  than  the  adaptive  controller 
alone,  but  it  can  still  be  improved  upon.  By  using  a  connectionist  system  to  Icam  the 
function,  it  is  easy  to  calculate  the  partial  aerivatives  of  that  function,  which  in  turn  allows 
better  estimaus  of  unmodeled  dynamics,  and  better  estimates  of  the  effect  ot  control  action 
on  state.  This  modified  controller  performed  better  tiian  eitiier  the  adaptive  controller  alone 
or  the  o.riginal  hybrid  system. 

llie  feedforward,  sigmoid  learning  system  was  able  to  Icam  the  inquired  functions 
accurately,  but  the  lean?ing  tended  to  be  slow.  The  problem  of  slov/  convergence  is  widely 
nxoguized  and  is  declt  v/ith  by  methods  such  as  .Delta-Bar-Delta,  which  accelerae  learning 
a  great  deal  in  published  experiments.  Unfortunately,  those  problems  used  for  comparison 
usually  involve  small  sets  of  training  examples.  Tl-c  learning  problem  that  arose  in  this 
thesis  thwieticaily  required  an  infinite  training  set.  In  practice,  Delta-Bar-Delta  was  found 
to  be  very  sensitive  to  the  choice  of  learning  parameters.  Even  modifying  Delta-Bar-Delta 
to  use  two  traces  instead  of  one  did  not  solve  this  problem,  and  it  actually  introduced 
anntlier  parainete.r  that  had  to  be  ctiosen.  I'herfcfcre,  methods  for  accelerating  ccrivergence 
on  small  test  problems  do  not  ap{)ear  to  scale  as  well  as  coininonly  thought. 
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6.2  R£'COMM]BM>ATIONS  FOR  FUTURE  WORK 


The  deskfid  reference  trajeciory  used  in  these  expenments  was  chost'n  manually  to 
give  fairly  fast  response  while  still,  being  achievable  with  die  10  Newton  force  constraint  on 
the  controller.  It  would  be  desirable  to  to  automate  the  choice  of  reference,  and  tliis  may  be 
possible.  The  teference  trajakory  could  start  off  as  a  poor  comroiler  which  is  achievable 
without  using  much  force.  It  could  then  be  slowly  improved  (automatically)  until  the 
actuators  nearly  saturate,  thus  finding  the  be  st  reference  that  can  be  matched  by  this  hybrid 
controller  architecti're.  Tlie  reference  could  even  be  a  function  of  state,  stored  in  a  separate 
connectionist  network. 

The  learning  systems  use.<i  here  learned  very  good  approximations,  but  die  learning 
tended  lo  be  s'ow.  The  Delta-Bar-Delta  algorithm  improves  the  rate  of  convergence  for 
small  sets  of  training  points,  but  was  not  effective  for  learning  as  part  of  a  hybrid  conuol 
system,  even  after  being  modified.  It  tended  to  be  too  sensitive.'  to  the  choice  of  learning 
parameters,  learning  based  on  following  the  first  derivative  should  be  faster  if  accurate 
measurements  of  die  second  derivatives  can  be  found,  so  a  system  such  as  Dclta-B  sr-  Delta 
should  be.  useful  if  it  can  automate  the  choice  of  parameters,  perhaps  based  on  ar*  esrimate 
of  how'  accurate  its  second  dtrivative  estimates  are.  Furtner  -.esearch  hould  focus  on  dlls 
problem,  fierhaps  by  measuring  die  standard  deviation  of  the  individual  measurenients  to 
form  an  estimate  of  the  accuracy  of  their  average  value. 


ATTACHMENT  1 


BIBLIOGRAPHY 


[Ast83J  AstrSm,  K.,  "Theory  and  Application  of  Adaptive  Control  -  A  Survey," 
Automatica,  Vol.  19,  No.  5,  1983. 

[BB90]  Ba  rd,  L.  and  W.  Baker,  "A  Conaectionist  Learning  System  for  Nonlinear 
Control,"  Proceedings,  AIAA  Conference  on  Guidance,  Navigation,  and 
Control,  Portland,  OR,  August,  1990. 

[BF90]  Baker,  W.  and  J.  Farrell,  "Connectionist  Learning  Systems  for  Control," 
Proceedings,  SPIE  OE/Boston  '90,  (invited  paper),  November,  1990. 

[Bar89]  Barto,  A.,  "Connectionist  Learning  for  Control:  An  Overview,"  COINS 
Technical  Report  89-89,  Department  of  Computer  and  Iniormation  Science, 
University  of  Massachusetts,  Amherst,  September,  1989. 

[BS90]  Barto,  A.,  and  S.  Singh,  "Reinforcement  Learning  and  Dynamic 
Programming,"  Proceedings  of  the  Sixth  Yale  Workshop  on  Adaptive  and 
Learning  Systems,  Ne\,  Haven,  CN,  August,  1990. 

[BSA83]  Barto,  A.,  R.  Sutton,  and  C.  Anderson,  "Neuronlike  Adaptive  Elements  That 
Can  Solve  Difficult  Ixaming  Control  Problems,"  IEEE  Transactions  on 
Systems,  Man,  and  Cybernetics,  vol.  SMC-13,  No.  5,  Scptember/Octobcr 
1983. 

[BSW89j  Barto,  A.,  R.  Sutton,  and  C.  Watkins,  "Learning  and  Sequential  Decision 
Making,"  COINS  I'echnical  Report  89-9.5,  Department  of  Computer  and 
Information  Science,  University  of  Massachusetts,  Amherst,  September,  1989. 

[D'A88]  D'Azzo,  J.,  Linear  Control  System  Analysis  &  Design:  Conventional  and 
Modem,  McGraw-Hill,  New-York,  1988 

[FGG90]  Farrell,  J.,  Goldenthal,  W.,  and  K.  Govindaiajan,  "Connectionist  Learning 
Control  Systems:  Submarine  Heading  Control,"  Proceedings,  29th  IEEE 
Conference  on  Decision  and  Control,  December,  1990. 

[Fu86]  Fu,  K.,  "Learning  Control  Systems  -  Review  and  Outlook,"  IEEE 
Trans  actions  on  Pattern  Analysis  and  Machine  Intelligence,  vol.  PAMI-8, 
No.  3,  May,  1986. 

[GF90]  Goldenthal,  W.  and  J.  Farrell,  "Application  of  Neural  Networks  to  Automatic 
Control,"  Proceedings,  AIAA  Conference  on  Guidance,  Navigation,  and 
Control,  August,  1990. 

[HW89]  Horaik,  K.,  r.nd  M.  White,  "Multilayer  ’feedforward  Networks  are  Universal 
Approximators,"  Neural  Networks,  Vol.  c,  1989. 


243 


ATTACHMENT  1 


[Jac91]  Jacobs,  R.,  "Increased  Rales  of  Convergence  Through  Learning  Rate 
Adaptation,"  Neural  Networks,  1,2,  p.  295-301,  1991. 

[Jam90]  Jameson,  J.,  "A  Neurocomputer  Based  on  Model  Feedback  and  the  Adaptive 
Heuristic  Critic,"  Proceedings  of  the  International  Joint  Conference  on 
Neural  Networks,  1990. 

[Jor88]  Jordan,  M.,  "Supervised  Learning  and  Systems  with  Excess  Degrees  of 
Freedom,"  Technical  Report  COINS  TR  88-27,  Massachusetts  Institute  of 
Technology,  1988. 

[Klo88]  Klopf,  H.,  "A  Neuronal  Model  of  Classical  Conditioning,"  Psychobiology, 
vol.  16  (2),  85-125,  1988. 

[LeC87]  LeCun,  "Modeles  Connexlonnistes  le  I’Apprentissage,"  Ph.D  Thesis, 
Universite  Pierre  ct  Marie  Curie,  Paris,  1987. 

[MC68]  Michie,  D.,  and  R.  Chambers,  "Boxes:  an  Experiment  in  Adaptive  Control," 
Machine  Intelligence,  vol.  2,  E.  Dale  and  D.  Michie,  Eds.,  Edinburgh, 
Scottland;  Oliver  and  Boyd  Ltd.,  1968. 

[Mil91]  Millington,  P.,  "Associative  Reinforcement  l.eaming  for  Optimal  Control," 
Master's  Thesis,  Massachusetts  Institue  of  Technology,  1991. 

tMP69]  Minsky,  L.  and  S.  Papert,  Perceptrons,  MIT  Press,  Cambridge,  MA,  1969. 

[NW89]  Nguyen,  D.,  and  B,  Widrow,  "The  Truck  Backer-Upper:  An  Example  of  Self- 
Learning  in  Neural  Networks,"  Proceedings  of  the  International  Joint 
Conference  on  Neural  Networks,  1989. 

[Pal83]  Palm,  W.,  Modeling,  Analysis,  and  Control  of  Dynamic  Systems,  John  Wiley 
&  Sons,  New  York,  1983. 

fPar82]  Parker,  D.,  "Learning  Logic,"  Invention  Report,  S81-64,  File  1,  Office  of 
Teclinology  Liscensing,  Stanford  University,  1982. 

[Ros62]  Rosenblatt,  F.,  Principles  of  Neurodynamics,  Spartan  Books,  Washington, 
1962. 

[RZ86]  Rumelhart,  D.  and  D.  Zipser,  "Feature  Discover  '  by  Competitive  1  earning." 

Parallel  Distributed  Processing:  Exploration :  in  the  Microstructure  of 
Cognition,  vol.  I,  Rumelhart,  D.,  and  J.  McClelland,  ed.,  MIT  Press, 
Cambridge,  MA,  1986. 

[RHW86]  Rumelhart,  D.,  G.  Hinton,  and  R.  Williams,  "Learning  Internal  Representation 
by  Error  Propagation,"  Parallel  Distributed  Processing:  Explorations  in  the 
Microstructure  of  Cognition,  vol.  1,  Rumelhan,  D.,  and  J.  McClelland,  ed., 
Mir  Press,  Cambridge,  M.^,  1986. 

[Sam67]  Samuel,  A.,  "Some  Studies  in  Machine  Learning  Using  tlie  Game  of  Checkers 
II  -  Recent  Progre...s,"  IBM  Journal  of  Research  nod  Development,  1 1,601- 
617,  1967. 


244 


ATTACHMENT  1 


[Sam.59] 

[Siin87] 

[Sut90] 

[Sut88] 

[V/at89] 

[Wer89] 

[Wer74] 

[Wid89] 

[WZ89] 

[WB90] 

r^ii88] 

[Y190] 


Samuei,  A.,  "Some  Studies  in  Machine  Learning  Using  the  Game  of 
Checkers,"  IBM  Journal  of  Research  and  Development,  3,  210-229,  1959, 
reprinted  in  Computers  and  Thought,  A.  Feigenbaum  and  J  Feldman, 
ed., McGraw-Hill,  New  York,  1959. 

Simpson,  P.,  "A  Survey  of  Artificial  Neural  Systems","  Technical  Document 
1 106,  Naval  Ocean  Systems  Center.  San  Diego,  CA,  1987. 

Sutton,  R.,  "Artificial  Intelligence  by  Approximating  Dynamic 
Programmming,"  Proceedings  of  the  Sixth  Yale  Workshop  on  Adaptive  and 
Learning  Systems,  New  Haven,  CN,  August,  1990. 

Sutton,  R.,  "Learning  to  Predict  by  the  Methods  of  Temporal  Differences," 
Machine  Learning,  Kluwer  Academic  Publishers,  Boston,  MA,  vol.  3;  9-44, 
1988. 

Watkins,  C.,  "Learning  from  Delayed  Rewards,"  Ph.D.  thesis,  Cambridge 
University,  Cambridge,  England,  1989. 

Werbos,  P.,  "Backpropagation  and  Neurocontrol:  A  Preview  and  Prospectus," 
Proceedings  of  the  International  Joint  Conference  on  Neural  Networks, 
Washington,  D.C.,  pp.  209-216,  vol.  I,  1989. 

Werbos,  P.,  Beyond  Regression:  New  Tools  for  Prediction  and  Analysis  in 
the  Behavioral  Sciences,  PhD  Dissertation,  Harvard  University,  1974. 

Widrow,  B.,  "AD ALINE  and  MADALINE,"  Proceedings  of  the 
International  Joint  Conference  on  Neural  Networks,  1989. 

Williams,  J.  and  D.  Zipser,  "A  Learning  Algorithm  for  Continually  Running 
Fully  Recurrent  Neural  Nerworks,"  Neural  Computation,  1,  270-280,  1989. 

Williams,  R.  and  L.  Baird,  "A  Mathematical  .Analysis  of  Actor-Critic 
Architectures  for  Learning  Optimal  Controls  Through  Incremental  Dynamic 
Programming,"  Proceedings  of  the  Sixth  Yale  Workshop  on  Adaptive  and 
Learning  Systems,  New  Haven,  CN,  August,  1990. 

Williams,  R.,  "Towards  a  Theory  of  Reinforcement  Learning  Connectionist 
Systems,"  Technical  Report  NU-CCS-88-3,  College  of  Computer  Science, 
Northeastern  University,  July,  1988. 

Youcef-Toumi,  K.  and  O.  Ito,  "A  Time  Delay  Controller  for  Systems  with 
Unknown  Dynamics,"  ASME  Journal  of  Dynamic  Systems,  Measurement, 
and  Control,  Vol.  112,  March,  1990. 


245 


ATTACHMENT  2 


Reprint  of: 

Nistler,  N.  (1992).  A  Learning  Enhanced  Flight  Control  System  fir  High 
Perfonnance  Aircraft,  CSDL  Report  T-1127,  S.M.  Thesis,  Depart  aent  of 
Aeronautics  and  Astronautics,  M.LT. 


24G 


ATTACHMENT  2 


A  Learning  enhanced  flight  Control  System  for  High 

PERFORMANCE  AIRCRAFT 


by 

Noel  F.  Nistler 

Submitted  to  the  Department  of  Aeronautics  and  Astronautics 
on  May  8. 1992  in  partial  fulfillment  of  the  requuements  for  the 
Degree  of  Master  of  Science  in  Aeronautics  and  Astronautics 


ABSTRACT 

Numerous  approaches  to  flight  control  system  design  have  been  proposed  in  an 
attempt  to  govern  the  complex  behavior  of  high  performance  aircraft.  Gam  scheduled 
linear  control  and  adaptive  control  have  traditionally  been  the  most  widely  used 
metliodologies,  but  they  are  not  without  their  limitations.  Gain  scheduling  requires  large 
amounts  of  a  priori  desigi^  information  and  costly  manual  tuning  in  conjunction  with  flight 
tests,  while  still  lacking  an  ability  to  accommodate  unmodele4  dynamics  and  model 
uncertainty  beyond  a  hmited  amount  of  robustness  tliat  can  be  incorporated  into  the  design. 
Adaptive  control  is  suitable  for  nonlinear  systems  witli  unmcdeled  dynamics,  but  has 
deficiencies  in  accounting  for  quasi-static  state  dependencies,  Momover,  inherent  time 
delays  in  adaptive  control  make  it  difficult  to  match  the  performance  of  a  well-de,signcd  gain 
scheduled  controller.  An  alternative  approach  that  is  able  to  compensate  for  the 
inadequacies  experienced  with  traditional  control  techniques  and  to  automate  the  tuning 
process  is  desired. 

Recent  learning  techniques  have  demonstrated  an  ability  to  synthesize  multivariable 
mappings  and  are  thus  able  to  learn  a  functional  approximation  of  the  initially  unknown 
state  dependent  dynamic  tiehavior  of  the  vehicle.  By  combining  a  learning  component  with 
an  adaptive  controller,  a  new  hybrid  control  system  that  is  able  to  adapt  to  unmodeled 
dynamics  and  novel  situations,  as  well  as  to  learn  to  anticipate  quasi-static  state 
dependencies  is  fonned. 

This  diesis  explores  the  concept  of  augmenting  an  adaptive  flight  controller  with  a 
learning  system.  The  goal  is  to  examine  the  extent  to  which  learning  can  be  used  to 
improve  the  performance  of  ir  adaptive  flight  control  system  architecture,  as  well  as  to 
highlight  some  of  the  difficulties  introduced  %  learning  augmentation.  Performance  of  the 
control  system  is  defined  in  terms  of  its  ability  to  control  a  nonlinear,  three- degree-of- 
freedem  aircraft  model  reacting  to  altitude  and  velocity  coirmiands.  Tli'S  hybrid  approach 
offers  potential  advantages,  over  convertional  techniques  in  terms  of  performance,  model 
uncertainty  accommodation,  and  tuning  costs. 
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1  INTRODUCTION 


Numerous  approaches  to  flight  control  system  design  have  been  proposed  in  an 
attempt  to  govern  the  behavior  of  high  performance  aircraft.  This  class  of  aircraft  presents 
formidable  challenges  to  the  desifner  since  by  nature  their  dynamics  are  nonlinear, 
multivariable,  and  coupled  (Etkin  (1982)).  Moreover,  high  performance  aircraft  tend  to 
exhibit  modes  with  relatively  high  natural  frequencies  and  minimal  damping  as  compared  to 
typical  aircraft.  Gain  scheduled  linear  control  and  adaptive  control  appear  to  be  the  most 
popular  methodologies  for  flight  control  law  design,  but  they  are  not  without  their 
limitations.  Gain  scheduling  techniques  combine  multiple  linear  control  laws  to  formulate  a 
nonlinear'  controller  (Lewis  &  Stevens  ( 1992)).  This  process  requires  large  amounts  of  a 
priori  model  information  and  potentially  costly  manual  tuning,  since  a  separate  linear 
controller  must  be  designed  for  each  of  a  selected  set  of  distinct  regions  of  the  operating 
envelope.  In  addition  to  this  tedious  design  approach,  gain  scheduled  controllers  lack  the 
ability  to  accommodate  unmodeled  dynamics  and  model  uncertainty  beyond  a  limited 
amount  of  robustness  that  can  be  incorporated  into  the  design.  Adaptive  control  is  suitable 
for  nmilinciu-  systems  with  unmoicled  dynamics  but  has  deficiencies  in  effectively 
accounting  for  quasi-static  state  dependencies  Morcovci ,  inherent  time  delays  of  adaptive 
control  make  it  difficult  to  match  the  performance  of  an  ideal  gain  sched'jled  controller 
(Stein  (1980)).  This  thesis  presents  an  alternative  approach  that  compensates  for  some  of 
the  inadequacies  experienced  with  these  traditional  control  techniques. 

By  combining  an  adaptive  component  with  a  learning  sysleni,  an  innovative  new' 
hytnid  controller  is  fonned  that  allows  each  nicchanisni  to  f-xiis  on  tlie  control  ohjectivc  for 
which  It  is  best  suited.  1  he  primary  role  of  the  adaptixe  control  coinpon'Tit  in  the  Ir  limi 
system  ;s  to  acccmnriodate  unmodeled  dynamics  ti.e,,  dymunical  Ixdiavior  that  is  roif 
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expected,  .based  on  the  design  model).  Additionally,  the  adaptive  component  has  the 
auxiliary  task  of  providing  estimates  of  any  observed  unmodeled  state  dependent  dynamic 
behavior  to  the  learning  system  (i.e.,  unknown  dynamics  that  are  a  function  of  state  in 
areas  of  the  state  space  where  learning  has  not  occurred)  These  estimates  are  obtained  by 
observing  previou  p.'ant  behavior,  essentially  providing  delayed  estimates.  Moreover, 
since  no  use  is  made  of  past  estimates,  the  adaptive  component  can  be  considered  to  act 
without  memory.  Based  on  the  estimates  from  the  adaptive  coinponcnt,  a  leami  ig  system 
can  be  used  to  learn  a  functional  approximation  of  these  state  dependencies  and  ultimately 
reduce  model  uncertainty  in  the  system  Connectionist  networks  (which  include  artificial 
neural  networks)  have  demonstrated  the  ability  to  synthesize  highly  nonlinear,  multivariable 
mappings  (Funahashi  (1988),  Honiik,  etal.  (1989)).  More  specifically,  spatially  localized 
connectionist  networks  have  been  proposed  as  an  appropriate  learning  system  for  control 
applications  (Baker  &  Farrell  (1992)).  Armed  with  a  mapping  from  the  learning  system 
that  represents  the  previously  unknown  state  dependencies,  the  hybrid  controller  can 
anticipate  vehicle  behavior  that  is  a  function  of  state  and  compensate  accordingly, 
effectively  removi,ng  the  delay  in  the  estimates  provided  by  an  adaptive  controller.  The 
impact  of  a  controller  that  has  the  ability  to  anticipate  vehicle  behavior  can  be  seen  in 
improved  closed-loop  system  performance.  Moreover,  this  ability  to  learn  state 
dcpendeucie,s  offers  advantages  over  conventional  techniques  in  terms  of  model  uncertainty 
acconimodauon  and  automation  of  the  tuning  process. 


1 . 1  PROBLEM  DESCRIFFION 


This  thesi.s  presents  the  developvnent  and  apipLcaticn  of  a  hybnd  control  system  to 
the  [iroliiem  of  flight  conltol  Ln  a  high  !>erfonnaiice  aircraft.  Time  I>;lay  Control  (TIKI),  a 
rnoi'el  reference  adaptive  controile.r.  is  augmented  by  a  linear  laussian  connectionist 
.network,  to  form  the  hybrid  flight  controi  systcru.  This  hybrid  system  i.s  applied  to  the 
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controi  of  the  longif>idinal  motion  of  a  high  porfomiance  aircraft  during  various  altitude  and 
velocity  maneuvers.  Due  to  nonlinearities,  model  uncertainty,  unknown  dynamics,  and  a 
host  ot  other  difficulties,  high  performai*ce  aircraft  present  a  significant  challenge  to  the 
development  of  flight  control  systems. 

1 . 2  TKSSIS  OB  fECTl  VeS 

This  thesis  explores  the  use  of  a  learning  system  to  augment  an  adaptive  flight 
controller.  The  extent  to  which  learning  can  be  used  to  improve  an  adaptive  flight  control 
system  arcliitecture,  as  well  as  the  difficulties  introduced  by  learning  augmentation,  arc 
examined.  The  primary  objective  of  this  thesis  is  to  illustrate  the  advantages  of  a  hybrid 
adaptive  /  learning  control  system  in  terms  of  its  ability  to  accommodate  unmodeled 
dynamics  and  reduce  state  dependent  uncertainties  in  the  system  model.  This  hybrid 
approach  offers  advantages  over  conventional  techniques  in  terms  of  performance, 
robustness,  and  design  refinement  costs. 

1.3  OV-ERVIEW 


In  Chapter  2,  the  challenges  associated  with  high  performance  aircraft  control  law 
design  are  outlined.  Moreover,  background  information  on  traditional  control  techniques  is 
provided  to  serve  as  a  foundation  for  the  hybrid  coiitiol  law  development,  and  also  as  a 
basis  for  comparison  of  alte: native  designs.  The  theoretical  concepts  underlying 
connectioiiist  learning  systems,  as  well  as  seme  approaches  in  using  learning  systems  for 
control,  are  also  presented. 

In  Chapter  3,  the  technical  aspects  of  the  hybrid  control  law  are  developed.  This  is 
accomplished  by  first  presenting  the  underlying  theory  of  the  adaptive  component  and  the 
spatially  liKalizcd  learning  system  l>efore  moving  on  to  the  denvation  of  the  hybrid  system. 
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General  characieristics  of  the  hybrid  controller  are  also  presented. 

In  Chapter  4.  two  experiments  are  pr  seated  to  illustrate  the  implementation  and 
performance  of  the  hybrid  control  law.  The  first  experiment  uses  the  hybrid  system  to 
control  a  relatively  simple  nonlinear  aenxilastic  oscillator.  Due  to  Uic  low  dimensionality  of 
the  plant,  and  a  known  tn.ith  model,  the  analysis  and  evaluation  of  the  hybrid  control 
system  for  the  aeroclastic  oscillator  is  greatly  simplified.  In  tf^c  second  experiment,  the 
hybrid  system  is  applied  to  a  realistic  high  perfomiance  aircraft  model.  Descriptions  of  the 
major  components  of  the  aircraft  model  as  well  as  its  significant  characteristics  are  also 
provided.  An  evaluation  of  aircraft  performance  when  contiolled  by  tlie  hybrid  system  is 
presented  and  compared  with  other  designs  for  various  simulations.  Learning  system 
characteristics  are  also  described. 

Chapter  5  summarizes  the  major  contributions  of  this  thesis.  In  addition, 
recomriiendations  for  future  research  are  presented. 

A  bibliography  of  the  works  used  in  preparing  this  thesis  follows  Chapter  5. 
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2  BACKGROUND 


The  design  of  automatic  flight  coatroi  systems  for  high  performance  aircraft 
presents  significvu  challenges  for  the  control  engineer.  Although  v/ell-rteveloped  design 
methodologies  exisi  for  linear  systems,  similar  methodologies  and  related  theories  for 
nonlinear  systems  have  proven  to  be  elusive.  In  this  chapter,  the  formidable  challenges 

t 

inherent  in  high  performance  aircraft  c:ontroi  systeiT)  design  aie  presented  in  Section  2.1, 
conventional  r.ontrol  approaches  for  acconunodating  these  difficulties  arc  presented  in 
Section  2.2,  while  the  fundamentals  of  connectionist  learning  systems  and  some 
approaches  for  learning  control  are  introduced  in  Section  2.3. 

2 . 1  HIGH  PERFORMANCE  AIRCRAlFr  CHARACTERISTICS 

Because  the  aerodynamic  forces  and  moments  that  act  on  an  aircraft  are 
complicated,  nonlinear  functions  of  mtuiy  variables,  aiicraft  exhibit  complex  flight 
dynamics.  This  section  discusses  tlie  major  difficulties  associated  with  high  performance 
aircraft  flight  control  design. 

Due  to  the  high  cost  and  dangers  involved  in  fright  testing,  the  majority  of  the  effort 
in  flight  control  system  design  and  development  relies  on  a  model  of  the  aircrait  instead  of 
the  actual  vehicle.  Tiliis  approach  guarantees  the  presence  of  nurdel  uncertainiy  since  it  is 
impossible  to  capture  the  complete  dynarrucid  behavior  of  complex  aircraft  in  a  mode) . 
Errors  in  the  model  can  be  attributed  to  (wo  major  factors,  structural  and  parametric 
uncertainty  (Baker  ik  Farrell  (1S>91)).  Typically,  tlie  mathemali-caJ  siructmx.  of  aj»  aircraft 
model  is  derived  from  the  general  cquations-of-irtotion  for  a  single,  rigid  body  ITiese  are 
the  classical  Euler  equations.  1  xom  tliis  bitse  set  of  equations,  the  designer  dc'iernrines 
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additional  effects  that  must  be  included  fo  ^obtain  in  effective  flight  coniro’  system  design. 
Gyroscopic  effects  due  to  tin';  presenc  e  of  spinning  totorE  and  ae.'oelastic  effects  due  to 
inaccuiacies  in  the  rigid  body  assumption  have  historically  been  incorporated  into  the 
equations-oi-metion,  Beyond  the  dfficuMes  associated  with  tl/c  selection  and  development 
of  die  propej  model  stnicture,  the  sccuracy  of  tlw  actual  parameter  values  used  in  the  model 
plays  a  large  role  in  the  quality  of  an  idreraft  modol.  Since  values  of  the  pai-ameters  are 
typically  obtained  from  wind-tisrjnei  testing  or  computational  fluid  dyniunics  (e.g., 
computer  simulauorkS  of  ;f«uilov'  over  an  aircraft  model),  large  di.ccrepancies  are  possible. 
Additional  mcxiel  uncertainty  develops  frem  the  fact  that  not  aii  flight  conditions  can  be 
easily  modeled  by  a  single  global  model  structure.  For  this  reason,  separate  models  are 
needed  for  p<.,st-strdi  fligh’:,  verdc?]  take-off  modes,  and  other  extreme  flight  conditions.  In 
general,  all  models  cor*  "hi  a  degree  of  urcertainty  must  be  addresied  by  the  flight 
control  system. 

Sonlinearitliis  present  a  major  diffii  ulty  to  tlsc  conmoi  tnginee:r  since  no  general 
theory  for  conLol  design  synthesis  das  been  developed  for  noolinesr  systems.  Aircraft 
dynaxri'cai  l>eh'»vi'!.r  is  inherendy  uonlincar;  this  nonlmcar  behcrior  is  caused  primarily  by 
the  fact  that  the  aerc>d_  forces  and  .mo.mcnts  that  dictate  aircraft  motion  are 

themselves  complicated,  noiilinear  fiiiictions  of  many  variablci-.  Moreover,  the  full  six- 
degree -of-freeaom  li^d  body  equations -of  motion  include  aouJineur  terms.  Iihe  effects  of 
actuator  rate  bniting,  control  position  lindts,  and  otLer  ccutrol  linkages  are  further 
e.\amp.les  of  noniinearitics. 

Anoihcr  complication  exuerienced  with  flight  coiitiol  law  design  is  that  h'gh 
performar.ee  eii  craft  are  inlitrcrby  hi^k  dimcnnonal,  ir’iultivariabl?  systems,  A  six- 
degree-of  freedom  aircraff  uquires  twelve  coupled,  equations  to  fully  characterize  its 
rigid  bedy  dynamics.  Moreover,  nuitiple.  couirol  effectc.ni.  (e.g.,  rtabilator,  rudder, 
aileron,'.-;,  and  t'uottle)  eve  employed  to  achieve  the  primary  objective  of  simultaneocsiy 
contjoJlmg  a  numlK'r  of  oucpais  (e.g.,  <ilt.itude,  heading,  and  veiccity).  As  a  result,  any 
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control  system  that  attempts  to  decouple  the  dynamics  and  connect  independently  designed 
single-input  /  single-output  controllers  will  generally  sacrifice  performance  for  ease  of 
design. 

The  "high  performance"  qualifier  on  the  aircraft  model  implies  expanded  flight 
regimes  that  also  tend  to  exacerbate  control  difficulties.  These  regimes  include  high  angle- 
of-attack,  high  Mach,  and  other  regions  of  the  aircraft  envelope  where  large  changes  in  the 
aircraft  dynamics  can  be  expected.  For  example,  a  dynamic  mode  that  is  stable  and 
adequately  damped  in  one  region  of  the  envelope  may  become  lightly  damped  or  unstable  in 
another.  Tliis  fact,  combined  with  the  general  trend  toward  relaxed  static  stability,  requires 
rapid  control  action  to  stabilize  the  aircraft. 

The  above  discussion  illustrates  the  major  challenges  in  flight  control  law  design. 
Additional  difficulties  confront  the  control  engineer  due  to  the  design  methods  themselves 
(e.g.,  frequency  domain  methods  do  not  easily  lend  themselves  to  multivariable  control) 
and  due  to  challenges  in  applying  the  control  approach  to  tl  e  real  vehicle  (e.g.,  digital 
implementation  issues). 

2  2  TRADITIONAL  CONTROL  TECHNIQUES 

Automatic  flight  control  systems  have  evolved  from  the  "Sperry  Aeroplane 
Stabilizer,"  the  first  functional  autopilot,  to  advanced  multivariable  digital  systemus  capable 
of  generating  a  large  number  of  control  actions  per  second  (Lewis  &  Stevens  (1992)).  Of 
the  multitude  of  design  theories  and  methodologies  developed  for  flight  control  law  design, 
the  majonty  can  be  classified  into  the  two  broad  categories:  fixed  control  (e.g.,  robust 
control  and  gain  scheduled  control)  and  adaptive  control.  The  following  sections  introduce 
these  traditional  control  approaches.  Each  technique  is  critiqued  in  its  ability  to 
accommodate  the  design  difficulties  nresented  in  the  previous  section. 
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2.2. 1  Robust  Control 

Robust  control  has  gained  popularity  for  .flight  control  due  to  its  ability  to 
accommodate  a  certain  degree  of  uncertainty  as.?ociated  with  the  aircraft  model.  By 
explicitly  incorporating  uncertainty  into  the  design  process,  robust  controllers  provide 
performance  and  stability  guarantees.  However,  this  resilience  to  uncertainty,  or 
robustness,  is  usually  obtained  at  the  expense  of  a  loss  in  system  performance.  Since 
typical  robust  control  techniques  (e.g.,  classical  Bode  gain  /  phase  margin  methods  or  H« 
design)  rely  on  a  worst  case  estimate  of  the  modeling  error  or  margins  to  determine  a  fixed 
parameter  control  system,  the  resulting  control  law  is  often  conservative  when  applied  to 
the  nominal  plant  and  presents  a  tradeoff  between  stability  robustness  and  high 
performance.  Thus,  a  control  system  designed  to  account  for  modeling  uncertainty  results 
in  suboptimal  performance  relative  to  the  ideal  case  where  no  model  uncertainty  exists  To 
increase  performance,  the  designer  can  exploit  an  improved  model  having  less  uncertainty. 
However,  the  added  complexity  and  cost  of  a  more  refined  model  often  prohibits  this 
course  of  action.  Beyond  difficulties  in  achieving  maximum  performance,  robust 
controllers  are  ill-adapted  to  handle  highly  nonlinear  systems  or  unmodeled  dynamics.  In 
particular,  although  slight  perturbations  due  to  nonlincaritics  or  unknow  n  dynamics  can  be 
accommodated  by  further  increasing  the  bounds  on  uncertainty,  difficulties  in  achieving 
adequate  performance  are  further  exacerbated.  For  highly  nonlinear  aircraft  with 
substantial  urunodelcd  dynamics  or  model  uncertainty,  robust  control  is  impractical  from  a 
performance  point  of  view. 

2.2.2  Gain  Scheduling 

Flight  control  systems  for  modem  high  performance  aircraft  are  generally 
developed  with  a  gain  scheduling  design  methodology.  Gain  scheduling  rncih  d;  combine 
multiple  linear  controi  laws  to  formulate  a  nonlineai  controller.  'Fhis  control  approach  can 
accominixlate  many  of  the  difficulties  assocuated  with  complex  nonlinear  sv.steins,  such  as 
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high  performsince  aircraft.  To  formulate  this  nonlinear  control  law,  the  operating  envelope 
is  separated  into  an  ad  hoc  set  of  distinct  regions  where  the  dynamical  behavior  is 
approximately  linear.  By  linearizing  the  dynamics  in  each  distinct  region,  the  designer  is 
able  to  utilize  the  large  class  of  linear  control  theories  (e.g.,  robust  or  optimal  approaches) 
to  develop  a  control  law  best  suited  to  realize  local  performance  objectives.  The  combined 
nonlinear  control  law  is  achieved  by  transitioning  among  these  linear  control  laws  as  flight 
conditions  move  among  the  prescribed  linearized  regions.  Transitioning  is  accomplished 
by  interpolating  the  control  parameters  (e.g.,  feedback  gains)  as  a  function  of  scheduling 
variables  or  operating  condition.  Mach  number,  angle-of-attack,  and  dynamic  pressure  are 
the  most  commonly  used  scheduling  variables.  As  a  result,  highly  nonlinear  systems 
require  numerous  linearized  regions,  and  subsequently  a  multitude  of  linear  control  laws,  to 
approximate  nonlinear  behavior. 

In  addition  to  the  subjective  (and  tedious)  nature  of  defining  a  set  of  linearized 
operating  regimes  and  designing  a  linear  control  law  for  each  linearization  point,  gain 
scheduled  flight  control  systems  are  also  susceptible  to  model  uncertainty  and  unmodeled 
dynamics.  Differences  between  the  observed  and  predicted  vehicle  behavior  can  only  be 
corrected  by  on-line  manual  tuning  duruig  flight  testing. 

2.2.3  Adaptive  Control 

Adaptive  control  has  been  suggested  as  a  viable  method  for  aircraft  flight  control 
(Lewis  &  Stevens  (1992).  Stein  (1980)).  Adaptive  techniques  generally  rel  on  differences 
between  desired  and  observed  vehicle  behavior  to  adjust  (adapt)  variable,  internal 
parameters  to  ultimately  aciiieve  acceptable  closed-loop  performance.  Using  this  approach, 
adaptive  controllers  have  shown  an  ability  to  accommodate  nonlinear  plants  with 
unmodeled  dynamics.  However,  adaptive  controllers  encounter  difficulties  ui  systems  vve.h 
rapidly  varying  parameters  and  e  -.tensive  nonlinearity.  In  an  adaptive  technique,  the 
controller  must  wait  until  undesired  plant  behavior  is  observed  Ix'forc  it  can  ({etcrnitnc  ho  v 
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to  adjust  its  parameters.  Potentially,  several  control  intervals  might  be  required  to 
accurately  detect  and  compensate  for  variations  in  these  parameters.  Beyond  this  delay 
associated  with  detennining  the  conect  parameters,  sensor  noise  causes  additional  delay 
due  to  the  required  filtering.  For  vehicles  that  regularly  experience  large  parameter 
variations,  the  resulting  control  law  may  spend  large  portions  of  time  in  some  suboptimal, 
partially  adapted  configM  ^tion.  liiis  dileimna  is  exacerbated  by  the  reactive  nature  of 
adaptive  controliers  id  that  paramen  rs  must  me  re-tuned  whenever  the  vehicle  enters  a 
new  region,  even  if  d'  'rreo  va  's  had  previc  nsij'  been  determined  for  that  region. 
Hence,  adaps  *  oii'iollei  fail  to  mak  use  o  predictable  behavior  (c.g.,  state 
dept ■  ^dencies)  that  woi  d  reduv.o  :he  ime  spieui  n  partiallv  a^  \pk  '  states  and  ultimately 
improve  performanc  e  1  '  these  reasons,  ,difl  iilt  j  'idaptiv  controller  to  match 

the  perfo.  ^'ai,  oi  a  v/eL  dt  igne  ^  ga.  sc,  <•  dec  mtro 

Alt!  I- gh  t  as  common  as  ?ain  in.  -aiia  i  *  ap  '  tches,  multi-iegion 

adaptive  ct  .  rollers  be  e  also  been  sugge.  "»ed  as  ncai  i  r  ig  tro’  (Athans,  etal 
( '9 /  ),  Stein,  'ir  a/.  (Iv Essennjn  ^:»ch  diecmle;  n  iple  local  plant 

modeh  ./u  iin  ui  indirect  adap  ve  x  vrolli  m  work. 

2  d  s'f  ( TIONIST  »  Ej  RNPsC  S  "^EMS 

i dune  mist  learning  s  s.-  as  a  received  much  Ui  'ntion  in  the  research 

ct  run  anitv  ,  i  ctl.  i  otc  h  olvi  g,  o  ilemsinpat  ra  i  agnition,  a.ssociative 

Diefi  ory  t  abase  ic’rieval  c  t  s,  ’))  Moreovt  r,  rec  n  Mention  has  l)ertn  given 

u  .  ;di  I  conne.  m  nr  k  s  .y  ithcsize  muil  varial  ic,  nlinear  mappings  and 
iO'  tl  ,,  forii'.ation  c.iti  br  >iied  to  impu  ■«'  i  natr  otrol  systems.  In  this 
e  turn  .  "  t  hisl-.  'Y  :  <•  devclopni  s,  o  ^ lasts  t  f  connectionist  learning 

sy  .tens  !  .  rt  e  vant  to  the  control  problem  uc  .oed  earlier  is  presented.  Some 

alters iat!  ap  ,iciit  O'  sum  poi  Mno  co  e*  conist  systems  into  cuntiol  system  designs 
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arc  also  introduced. 

2.3.1  Foundations  of  Connectionist  Systems 

Connectionist  systems,  which  irndude  what  a’e  often  called  "anificial  neural 
networks,"  owe  their  foundations  to  biologists  and  research  psychologists  who  orig^ially 
studied  the  ability  of  neural  models  to  mimic  the  behavior  of  tlie  brain  (Rosenblatt  (1962/, 
Kiopf  (1988)).  Contemporary  connectionist  systems  have  advanced  significantly  from 
these  early  tieginnings  (Barto,  et  al.  (1983),  Rumelhart,  et  al.  (1986)).  Many  of  the  recent 
connectionist  learning  systems  emphasize  the  mathematical  theory  of  function 
eipproximaiion,  estimation,  and  optimization  (Baker  &  Farrell  (1992),  Poggio  &  Girosi 
(1990)). 

Connectionist  learning  systems  typically  contain  a  large  number  of  simple 
processing  units  that  are  combined  in  a  highly  interconnected  architecture.  These 
processing  units,  also  known  as  nodes  or  "artificial  neurons,"  make  up  the  basic  building 
blocks  of  a  connectionist  system.  Figure  2.1  illustrates  the  internal  structure  of  a  simple,  3- 
input  node 


X2 

Figure  2.1  3-Input  /  I -Output  Simple  Node 
where  .rj,  X2,  and  x\,  are  the  node  inputs,  wj,  n  2.  and  ^^3  rre  weighling.s  for  the  respective 
inputs,  and  j>  is  the  sum  of  the  weighted  input.s.  The  outpu  of  the  network,  z,  is  simoly 
the  value  of  the  not^al  function /evaluated  at  y.  Nonlinear  nc  lal  functions  are  required  to 
realize  nonlinear  mappings.  'Ihrcc  examples  of  nodal  fi  actions  are  the  threshold. 
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sigmoidal,  and  Gaussian  functions. 

If  a  large  amount  of  a  priori  information  is  known  about  the  desired  mapping  of  the 
network,  the  weights  between  the  nodes  can  be  set  to  fixed  values  to  realize  the  network 
mapping.  However,  typical  conn  -ctionist  networks  use  nodes  with  fixed  functions  and 
adaptable  weights  that  are  adjusted  using  an  appropriate  learning  law.  Under  supervised 
learning,  the  amount  of  weight  adjustment  is  determined  by  evaluating  an  error  formed  by 
the  difference  between  the  calculated  output  of  the  network  and  a  known  desired  output 
(Melsa  (1989  ).  This  contrasts  with  the  weight  adaptation  by  unsupervised  learning,  where 
only  inputs  and  a  reinforcement  signal  that  characterizes  past  performance  (i.e.,  not  a 
knovn  desired  output)  are  utilized  in  adjusting  the  w'eights  (Barto  (1989;,  Mendal  & 
McLaren  (1970)).  Thus,  the  operation  of  adaptable  connectionist  networks  consists  of  two 
distinct  phases:  output  calculation  and  learning.  The  output  calculation  phase  is 
characterized  by  the  determination  of  tlie  network  output  based  upon  the  given  inputs, 
weights,  and  nodal  functions.  The  purpose  of  the  learning  phase  is  to  adjust  tlie  weights 
(using  either  a  su^iervised  or  unsupervised  technique)  to  obtain  desired  input  /  output 
behavior. 

Ccimectionist  networks  are  frequently  categorized  by  the  nodal  architecture  and 
associated  ouiput  calculation  or  by  the  learning  technique.  One  common  architecture 
dependent  on  a  specific  output  calculation  method  is  the  feedforward  connectionist  network 
(Funahashi  (1988),  ilon  .ik,  et  al.  (1989)).  In  feedforward  struciurcs,  the  output  for  any 
given  node  ir  not  connected  back  as  an  input  to  itself  by  any  feedback  loop.  Because  of 
this  feature,  present  outputs  do  not  impact  future  output  values  (present  outputs  can  impact 
future  outputs  in  the  learning  phase  by  adjustment  of  the  weights).  Moreover,  the  output  of 
the  entire  syste  n  can  l>c  c  dculatcd  in  a  single  pass  since  each  layer  simply  outputs 
computed  values  based  on  ii.puLs  from  the  previous  layer.  Figure  2.2  illustrates  a  simple 


feedforcvtird  networx. 
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Figure  2.2  Simple  2-Input  /  1-Output  Feedforward  Network 

Another  major  class  of  connectionist  systems  consists  of  feedback  (or  recurrent) 
networks.  The  distinguishing  feature  of  a  feedback  network  is  that  nodes  have  the  ability 
to  influence  themselves  through  feedback.  The  feedback  can  act  directly  from  a  given  node 
to  itself  or  indirectly  through  other  nodes.  Although  feedback  networks  have  an  ability  to 
learn  dynamical  mappings  (e.g..  mappings  that  change  with  time),  the  learning  laws 
become  complicated  since  the  network  output  is  no  longer  simply  a  function  of  network 
inputs  and  weights  (it  is  also  a  function  of  the  state  of  the  network).  Moreover,  any 
feedback  network  representing  a  dynamical  mapping  can  be  expressed  as  an  equivalent 
dynamic  system  of  two  static  mappings  separated  by  an  integration  or  unit  delay  operator 
(Livstone,  et  ai.  (1992)). 

By  altering  the  nodal  function,  output  calculation,  learning  approach,  or  a  host  of 
other  variables,  connectionist  networks  have  been  developed  that  display  an  tirray  of 
different  properties  (Barto  (1989),  Melsa  (1989).  Minsky  &  Papcit  (1969)).  Section  2.3.2 
discusses  some  of  die  most  popular  early  conne<;iionisl  systems. 

2.3  2  Early  Connectionist  Networks 

One  of  the  earliest  uses  of  a  conncctionis*^  melhcxiology  for  learning  was  the 
percepiroTi  network  (Rosenblatt  (1962)).  .A  simple  jxi-  '-ptron  network  is  comprised  of 
Single  or  multiple  layers  ,)f  jxn  cptron  iKkIcs  conriecte<.(  m  a  feedforward  config  ration.  A 
perceptroa  node  is  characterized  by  the  binary  threshold  function  used  to  foimulate  the 
output  from  tlie  wc.ghted  sum  of  its  aiputs  ics  shown  in  Figure  2.3.  If  the  weuthted  sum  is 
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greater  than  some  prescribed  threshold  value,  the  perceptron  node  outputs  an  ’  on"  signal  or 
the  value  1.  For  inputs  below  the  threshold,  the  nod*  is  considered  "off  and  ouiputs  -1. 


I  1  if  y>  threshold 
^  ^  |-l  if  _y<  threshold 


Jiy) 

.t- 


i 

threshold 


Figure  2.3  Binary  Thieshold  Function 

Perceptron  netw  orks  have  illustrated  surprisingly  powerful  mapping  capabilities. 
Minsky  and  Papert  demonstrated  the  ability  of  single-layer  perceptron  networks  to  learn 
any  discriminant  runction  among  classes  that  arc  linearly  separable,  using  a  simple  learning 
rule  (Minsky  &  Papert  (195d)).  The  learning  rule  adjusts  the  weights  incrementally 
depending  on  their  impact  on  the  error  between  the  network  output  and  the  prescribed 
output.  It  was  later  shown  that  multi-layer  perceptron  networks  are  capable  of 
discriminating  a  large  class  of  nonlinearly  separable  problems.  However,  no  general 
guarantee  on  the  ability  of  any  learning  law  to  locate  :ui  optimal  set  of  weights  exists  for 
rnulti-layer  net  x'orks  as  in  the  single  layer  case. 

Another  pioneering  conncctionist  network  is  the  adaptive  linear  element,  or 
ADALINE  (Widrow  &  Hoff  (I960)).  ADALINE  networks  eonsist  of  simple  mxles 
connected  in  a  feedforward  architecture.  The  distinguishing  features  of  an  .MdALdNE 
network  include  a  nodal  function  that  simply  outnti's  the  weighted  sum  of  the  inputs  (i.e., 
/(v)  -  v)  and  a  nvumali/ed  least  mean  ..iiuare  (I.MS)  learning  law.  Ihuler  sujiervised 
learning  where  tlic  cuneni  inputs  and  dcsir'“d  output  are  known,  the  I  MS  Icarfung  law 
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attempts  to  minimize  the  mean  squared  value  of  the  error.  When  the  weights  are  changed  in 
proportion  to  the  error,  an  AD  ALINE  network  is  guaranteed  to  converge  to  the  minimum  of 
the  mean  squared  error  for  dnearly  separable  problems.  In  ■’.n  attempt  to  e.xtend  this  result 
to  nonlinearly  separable  problerrts,  AD  ALINES  can  be  connected  in  a  hierarchical  structure 
to  form  a  network  of  multiple  adaptive  linear  elements  (MADALINES).  Although 
MADALINES  are  capable  of  producing  complicated  nonlinear  mappings,  determining  the 
optimal  weights  between  layers  of  ADALINES  is  a  difficult  process.  These  difficulties  aie 
the  result  of  LMS  learning  laws  being  limited  to  the  determination  of  optimal  ADALINE 
weights  and  not  the  weights  associated  with  their  connecting  layers  (ivlelsa  (1959)). 

2.3.3  The  Backpropagaf ion  Network 

Although  llie  advent  of  jierceptror.s,  AD.\LINES,  MADALINES,  and  their  variants 
played  a  large  role  in  the  development  of  connectioni.st  networks,  tlie  latest  resurgence  of 
interest  in  leannng  systems  can  be  attiibuted  »o  the  backpropagation  sigmoidal  network. 
Although  backpropagation  is  sLictly  speakmg  a  learning  law,  its  extensive  use  has  resulted 
m  the  name  lieing  generalized  to  denote  the  large  class  of  feedfo.  ard  multi-layer  networks 
that  employ  this  particular  learning  approach.  Similar  to  the  early  architectures, 
backpropagation  networks  are  constnicfcd  from  the  combination  of  simple  nodes  arranged 
in  a  hierarchical,  feedforward  fashion.  However,  instead  of  the  threshold  and  identity 
functions  assix'iatcd  with  the  simple  perceptron  and  ADALINE  networks  respectively,  the 
backpropagation  notic  uses  :*  nor.^c  nodal  function.  One  of  the  most  commonly  used 
mxlal  fiinclions  is  the  siginoulal  function  ilkistrated  in  Fig  2  4. 
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Figure  2.4  Sig.iioidaJ  Function 

A  sigmoidal  function  is  a  continuous,  monotonically  increasing  function  with  finite 
asymptotic  values.  As  a  result,  a  sigmoid  offers  advantages  over  discontinuous  nodal 
functions  in  that  it  is  continuously  differentiable,  which  plays  a  large  role  in  the  gradient 
leaminj,  algorithm  described  below. 

A  typical  sigmoidal  backpropagation  network  is  shown  in  Figure  2.5.  This 
network  architecture  is  generally  sut  -divided  into  three  distinct  regions  or  layers:  input 
layer,  hidden  layers,  and  output  layer. 


Figur?  2  S  Typical  2-lnput  /  I -Output  Feedforward  Networv  A  ith  'I  luee  Hidden  la  yers 
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The  first  region,  the  input  layer,  is  characterized  by  nodes  that  act  as  an  interface  between 
network  inputs  and  the  subsequent  hidden  layer  by  simply  passing  die  input  value  to  a  set 
of  nodes  in  the  first  hidden  layer  (alth'^ugh  weignting  is  sometimes  added  to  the  signal). 
Moreover,  there  is  the  saiiie  number  of  nodes  in  the  input  layer  as  there  are  inputs,  and  each 
input  layer  neuron  typicall>  passes  its  value  to  each  node  in  the  subsequent  layer.  The 
second  region  contains  the  hidden  nodal  layers.  In  this  region,  the  weighted  sum  of  tlie 
outputs  from  the  previous  layer  is  used  as  the  input  to  each  sigmoid  function  to  compute  the 
output  of  the  node.  The  output  is  subsequently  passed  to  a  following  hidden  or  output 
layer.  The  final  region  is  the  output  layer,  which  contains  the  same  number  of  output 
nodes  as  there  are  network  outputs.  The  function  of  the  output  layer  is  to  compute  the 
weighted  sum  of  its  inputs  and  pass  tliis  value,  or  the  value  of  a  sigmoid  function  evaluated 
at  this  weighted  sum,  as  the  network  output.  T3rpically,  the  number  of  processing  nodes  in 
backpropagation  networks  is  large  compaied  to  the  number  of  different  kinds  of  nodal 
functions  used  in  tlie  network,  with  networks  using  a  single  nodal  function  being  the  most 
common. 

Mthough  the  selection  of  the  network  architecture  is  significant,  the  performance  of 
connectionist  networks  is  ultimately  determined  by  the  ability  of  the  learning  law  to  find  the 
optimal  weights.  For  backpropagation  networks,  weights  are  adjusted  using  an  "crroi 
backpropagation"  algorithm  (Rumelhart.  et  al.  (1986))  Whereas  the  learning  laws  of  early 
connectionist  networks  had  difficulties  in  properly  adjusting  connecting  layer  weights,  the 
error  backpropagation  algorithm  provides  a  systematic  method  to  adjust  weights  in  all 
adaptable  layers  The  basic  enor  backpropagation  algorithm  uses  a  supervised  gradient 
aesc.a?  method  to  mcrementallv  adjust  the  weights  in  the  neg.ative  direction  of  the  gradient 
(with  lespect  io  the  weightsi  ot  a  co.sl  function.  The  genera'  form  of  the  gradient  ruli  is 
shown  in  Fquation  (2,1}  below' 

*  <)./ 

Am’ —  (2,1) 
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where  w  is  a  vector  whose  elements  are  the  input  weights,  a  is  the  learning  rate  (i.c.,  the 
step  size),  and  J  is  the  cost  function  to  be  minimized.  The  most  commonly  used  cost 
function  to  be  minimized  is  a  quadratic  function  of  the  error  between  the  network  output 
and  some  desired  output.  In  many  supervised  learning  applications,  the  network  is  trained 
on  a  finite  number  of  (known)  input  /  output  sample  points.  In  th''  ■^se.  known  as  batch¬ 
mode  training,  the  quadratic  error  cost  function  takes  the  following  form. 

J  =  W*. )  -  (2.2) 

”  1=1 

Here,  n  is  the  number  of  training  examples,  x/  is  the  network  input  for  the  ^raining 
sample,  is  the  desired  output  at  tlie  r^^  training  sample,  and  f„„{x,,w)  is  the  actual 

output  of  the  network  for  the  given  input  and  weights.  Using  this  technique,  the  we'^hts 
are  adjusted  once  per  each  pass  or  epoch  through  all  the  training  examples.  Recalling  tiiat 
the  output  of  a  layer  is  a  function  of  the  output  of  the  previous  layers,  the  partial  derivatives 
of  the  cost  function  with  respect  to  an  individual  weight  can  be  found  by  forming  a  chain 
rule  of  paiU<il  denvaiives  and  working  backward  along  the  same  connections  as  the  original 
forward  path.  Since  the  sigmoid  is  a  continuous  function,  the  partials  always  exist. 
Hence,  propagation  of  the  errors  backward  during  the  learning  stage  requires  essentially  the 
same  amount  of  compulation  time  as  the  forward  calculation  of  the  network  output. 

As  with  all  gradient  descent  methods,  the  presence  of  local  minima  prevent  any 
guarantees  being  placed  on  the  ability  of  the  learning  algorithm  to  converge  to  the  optimal 
solution.  Moreover,  simple  gradient  descent  algorithms  lend  to  converge  slowly, 
especially  if  there  are  "iroughs”  in  the  error  suiface  (Baird  (1991)),  Since  the  goal  of 
leam'ng  is  to  follow  the  gradient  in  a  downhill  direction,  a  small  learning  rate  results  in 
slow  convergcncf*.  If  the  learning  rate  is  loo  high,  the  weight  vector  may  completely 
bypass  the  trough  to  some  possibly  suboptimal  plateau  or  oscillate  acros.s  the  boftooi  of  fhe 
trough  with  htt.le  movet^rent  in  the  direction  of  the  nuninnim. 

If  an  acceptable  icarning  rate  is  used,  or  if  one  of  several  teeiiaiques  for  s(iecding  up 
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convergence  is  applied  (e.g.,  adding  momentum  terms  to  the  weight  update  equation 
(F  'elhart,  et  al  (1986))  or  using  second  order  derivative  information  on  the  cost  (Jacobs 
(  )),  backpropagation  networks  have  shown  the  ability  to  adequately  map  highly 

nonbnear  functions.  In  fact,  sigmoidal  backpropagation  networks  with  more  than  one- 
hidden  layer  can  represent  any  function  to  a  desired  degree  of  accuracy  given  enough  nodes 
and  training  samples  (Funahashi  (1988),  Homik,  et  al.  (1989)).  This  universal  function 
approximation  property  has  played  a  major  role  in  the  resurgence  of  the  sigmoidal 
backpropagation  network  ir  ^plications  ranging  ffo/n  pattern  recognition  to  automatic 
control.  However,  one  should  recall  that  due  to  the  presence  of  local  minima  in there  is 
no  guarantee  tha-  a  given  learning  rule  will  actually  yield  the  weights  that  represent  the 
desired  mapping. 

Many  s  aiiants  of  conncctionist  networks  have  been  developed  in  an  attempt  at 
improved  leandng.  However,  the  majority  of  all  systems  have  one  common  characteristic. 
Learning  is  essentially  a  process  of  functionjtl  appioximation,  where  inputs  and  desired 
outputs  are  synthesized  to  fonn  a  multivariable,  nonlinear  mapping.  The  type  of  learning 
system  used  and  its  associated  details  are  dependent  on  the  specific  application.  Section 
3  2  presents  one  such  specialized  approach  that  is  used  for  the  learning  augmented  control 
of  a  high  performance  aircraft. 

2.3.4  C'onnectionist  Learning  Systems  for  Control 

Due  to  their  ability  to  approximate  smov^th  multivariable,  nonlinear  functions, 
connectionisi  learning  systems  have  generated  a  large  arnount  of  interest  among  control 
engineers.  However,  a  single,  systenniik’  approach  for  the  application  of  conneclionist 
lr;.;uT!ing  systems  to  contro.l  has  not  yet  materialized  Tliis:  section  briefly  i-Uioduces  a  small 
si’b;e!  of  coinmo,r!iy  u,sed  :ipp; caches  for  !eariu.ag  coatrol,  and  lists  cferenscs  where 
further  uj.scus.ston  may  lx*  found.. 

('ripy/.ny  an  exiMing  concr-tHt’r  is  {rerhaps  one  of  the  simplest  teclintqt.!es  in 
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applying  connecaonisr  learning  systems  to  control.  Assunndng  there  exists  a  controller  that 
is  able  to  control  the  plant,  the  objective  of  the  connectionist  learning  system  is  to 
synthesize  the  mapping  between  the  inputs  and  the  desired  output  supplied  by  the  existing 
controller.  Using  this  approach,  the  learning  system  can  replace  an  existing  controller  in 
situations  where  the  existing  controller  is  impractical  (e.g.,  where  it  is  dangerous  for  a 
hu/iiau  to  control  the  plant)  or  where  the  learning  system  offers  a  less  costly  representation. 
This  approach  was  successfully  applied  to  a  pole  balancing  problem  by  Widrow  &  Smith 
(1964),  where  die  existing  control  law  was  supplied  by  a  human. 

Direci  inverse  control  is  another  method  of  applying  a  connectionist  learning 
system  to  control  (Werbos  (1989)).  Using  this  approach,  the  objective  of  the  learning 
system  is  to  identify  the  plant  inverse.  Tlus  is  accomplished  by  providing  the  output  of  the 
plant  as  the  network  input  and  the  input  to  the  plant  (i.e.,  control  signals)  as  the  desired 
network  output.  If  the  network  has  a  plant  inverse  (i.e.,  if  there  is  a  unique  plant  input  that 
produces  a  unique  plant  output),  then  when  the  desired  plant  output  is  provided  as  input  to 
the  network,  the  resulting  network  output  is  the  control  to  be  used  as  input  to  the  plant 
(Barto  (1989)).  The  drawbacks  to  this  technique  are  that  a  desm?d  refeicnee  trajectory  must 
be  known  in  order  to  supply  the  network  with  the  desired  plant  output  and  die  inverse  of 
the  plant  must  be  well-defined  (i.e.,  a  1-to-I  mapping  between  inputs  and  output.s  must 
exist). 

In  the  hackpropagalum  through  time  method  develoix'd  by  Jordan  (i9H8),  two 
connectionist  learning  systems  are  used.  The  objective  of  the  first  network  is  to  identify  the 
plant,  from  which  one  can  efficiently  cempu'te  the  derivative  of  the  model  output  with 
lesficct  to  its  input  by  means  of  back  propagation.  Subsequently,  propagating  errors 
ix-tween  actual  <uid  desired  phuit  outputs  bat  k  ihrougt;  this  nctwoik  produces  an  e.ifor  in  the 
ctuitsol  signal,  which  can  be  used  to  tram  the  sci  ond  nctw'ork  (Barto  (198*^')].  This 
api  roach  r-d’fcrv,  an  jn!p,rovemfnl  over  duect  inverse  control  .since  it  is  able  to  accomiruxiatc 
system:;  s.  iih  i!l-defincd  nverscs.  aithough  tiic  uesaed  traiectory  must  still  tie  known. 
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Another  approach  for  incorporating  learning  into  a  control  system  is  to  augment  an 
adaptive  technique  with  a  learning  system  to  form  a  hybrid  controller  (Baker  &  Farrell 
(1990),  Baird  (1991)).  Augmentation  of  the  adaptive  technique  may  be  implemented  using 
a  direct  or  indirect  approach.  Using  a  direct  approach,  the  learning  system  generates  a 
control  action  (or  set  of  control  parameters)  associated  with  a  particular  operating  condition. 
Tliis  control  action  is  then  combined  with  a  control  action  produced  by  the  adaptive  system 
to  arrive  at  the  control  that  is  applied  to  the  plant.  In  contrast,  for  the  indirect  approach,  the 
objective  of  the  learning  system  is  to  improve  the  model  of  the  plant.  Here,  the  learning 
system  generates  model  parameters  that  are  a  function  of  the  operating  condition.  The 
learned  model  par  ameters  are  combined  with  adaptive  estimates  to  arrive  at  a  model  of  the 
plant.  Given  a  presumably  improved  plant  model,  an  on-lme  control  law  design  is  used  to 
form  the  closed-loop  system.  A  particular  indirect  learning  augmented  approach  is  used  in 
this  thesis  and  is  developed  in  Chapter  3. 

Reinforcement  learning  has  also  been  suggested  as  a  method  of  applying 
connectionist  learning  systems  for  control  (MendaJ  &  McLaren  (1970),  Barto  (1989), 
Millington  (1991)).  The  major  difference  between  reinforcement  learning  and  the 
previously  discussed  approaches  is  that  under  reinforcement  learning,  the  objective  is  to 
optimize  the  overall  behavior  of  the  plant,  so  that  no  explicit  reference  /  desired  trajectory  is 
required.  As  a  result,  reinforcement  learning  essentially  involves  two  problems,  the 
construction  of  a  critic  that  is  capable  of  evaluating  plant  performance  in  a  raarmer  that  is 
consistent  w'ith  the  actual  control  objective,  and  the  determination  of  how  to  alter  controller 
outpuf^,  to  improve  pcrfomiar.ee  as  measured  by  the  critic  (Barto  (1989)).  The  latter  of  the 
two  problems  can  be  addressed  by  one  of  the  previously  discussed  techniques. 
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3  '  TECHNICAL  APPROACH 


As  discussed  in  Chapter  2,  contfol  law  design  for  high  p'  rformasice  i5.ircraift 
presents  challenging  md  unique  problcim  to  the  designer  'rradit, 'rre, '  .Achniques  have 
proven  to  be  either  prohibitively  costly  in  terms  of  the  cMon  r.  nuired  i  'u.  '(>;>  and  the 
complexity  of  developing  a  multi-region,  i.-'acarked  gasn  .sc.heduli.itg  d.'.:,'  ,'gn,  or  e:  imply 
sacrificed  performance  for  ease  of  design..  This  chapter  formally  preseysts  an  .if(,  stive 
metliod  of  integrating  an  adaptive  component  with  a.  learning  compoiYt'h''  to  form  a  new 
hybrid  control  law.  The  hybrid  system  is  presented  fay  introducing  each  con  -oonent 
separately  and  then  combining  tire  components  in  a  synergistic  aiTaogenwnt  to  >nn  a 
superior  flight  control  system. 

3 . 1  .-\DAFrr  CONTROL,  COMPONENT 


Numerous  adaptive  control  techniques  have  lx?en  developed  .fot'  .nonlinear  system.s 
with  unniodeled  dynamic, s  or  mode!  uncertainty  (Astixnn  &  Wntcrm?a!;.k  (1989),  .Siotine  &. 
Li  (1991)).  One  major  class  of  ad.aptive  control,  mode!  reference  adaptive  contro,! 
(MRAC),  is  considered  in  this  !:he,sis.  Tlic  majority  of  .t'.Ltto  :  'i5  can  w  grt'uped  ititc  two 
general  categories,  namely,  <ii,fect  and  indi,rect  adaptive  cooirol.  Direct  aviaptive  cc-ntroi 
approache,s  aie  charat:  f.cri2;ed  by  the  syrst-he.sis  of  coatroi  ':.igna,ls  cu.rf;:ctly  ob.scs  vcd 
plant  behavior  withoiu  the  f:i«cnef!t  of  an  espitcit  piant  vi  ide,!.  If.  conwass,  the  indirect 
adaptive  coatrcl  rrieihcHis  rely  bc.av.(iy  on  ;m  e.x.p),icit  r‘hc,u  ttiodel.  x  conlfol  lavv  tor  nn 
indirect  technique  crnpiov'.s  ,a  !r)ca!  plant  mcKlel  that  !,s  updated  i  ore  tiboesved  (d-icc' 
bci'ia\'iijs  .  Although  dev  'pi.tig  and  penc>Jiica.!ly  uixhi.nug  a  {rla.n!',  incKici  ,),s  oof  vv  .ihi:  ut  r; 
own  costs,  indirect  chniqi;es  have  the  advantage  vhit  many  diffeient  cofitud  dcsig, 
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techniques  tliat  are  based  on  explicit  plant  models  can  be  used.  In  either  case,  the  adaptive 
control  system  reacts  to  differences  between  desired  and  predicted  behavior  by  adjusting 
internal  parameters  to  achieve  desired  closed-loop  performance.  These  differences  in 
behavior  are  typically  attributed  to  nonlinearities,  unmodeled  dynamics,  mod  :1  uncertainty, 
Of  exogenous  disturbances. 

Although  conventional  adaptive  control  methods  have  the  ability  to  stabilize  and 
control  some  nonlinear  systems,  the  closed-loop  system  is  often  unable  to  match  the 
performance  of  a  well-designed  and  well-tuned  gain  scheduled  controller.  This  difference 
in  performance  stems  from  inherent  time-delays  or  lags  associated  with  adaptive 
controllers.  Typically,  the  process  of  updating  parameters  of  an  adaptive  control  law 
requires  several,  control  intervals  to  accurately  detect  and  compensate  for  variations  in  the 
plant  behavior.  Sensor  noise  exacerbates  tfiis  dilemma  since  the  required  filtering  creates 
additional  lag.  Adaptive  contro.!  approaches  also  have  performance  limitations  when 
presented  with  quasi-static  stale  dependent  disturbances.  In  particular,  sino  adaptive 
controllers  tire  .teaciive.  .by  raiure,  they  are  unable  to  learn  and  subsequent!  ■  predict  state 
dependent  Iwhavior  .  Baker  &  F;ii’i:ell  (1992)),  Even  if  the  plant  repeatedly  e  pcriences  the 
sa.me  disturbance  at  a  p,aruculai  location  in  the  state  space,  tlte  tudapiive  contr-  ier  must  wait 
until  the  et'fect.s  of  the  disc.rcp.iiicy  are  ob.served  tefore  it  can  iiijtiatt  manges  in  the 
piinimete,t;.,  Hcrtce,  adaptive  controllers  fail  to  make  complete  u.se  of  c  ,  yt:  nuaily  gained 
knowledge  A,s,  will  be  discussed  in  a  following  section  this  lua  'equa  /  of  adaptive 
can  he  laveicoine  with  the  addition  of  a  leairusig  compont  nt 
rhe  prinuuy  role  of  the  adaptive  control  component  in  the  hyorid  system  is  to 
.accoRsinodatf.:  usinKKjele.d  dyrsainical  behavior  (i  c,.  l.>e.havior  that  is  not  exjx-cted  b  esed  on 
the  design  c  o  iei).  Additionailv.  ibe  adaptive  component  of  the  hybrid  system  has  the 
aiixcburv  !as,k  ol  piesenting  csomaSes  o!  an  y'  observed  uuniodeled  slate  dcjKsnderU  dynamic 
!k  :i,,ivu.i  to  J'nc  ic.ir.'Viig  co.nfiuneni  \i  e  ,  rinknoHvn  ilynamics  tliai  are  a  function,  ol  state  n: 
areas  ysi  the  state  s|'.<ice  wheic  the  iearnes!  nwpping  can  Ixs  unproved) 
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3.1,1  Conti '  L. '  w  E)erivation 

Coosisu  It  with  the  above  discussion,  any  adaptive  control  approach  that  is 
applicable  <>  noiiu  S  vnamic  systems  witli  model  iincertainty  and  that  develops  estimates 
of  unknown  ate  u  components  of  the  plant  dynamics  is  a  candidate  for  the 

adaptive  .  imp  at  In  ii  ybrid  control  system.  Adaptive  techniques  that  require  small 
.aiiiounts  i,  on-  ne  ipi-  ion  :?je  especially  appealing  since  extra  computing  power  will 
le  required  to  tr<  in  iiv  i  ramii.  coini  ionenl.  One  such  adaptive  controi  technique  is  based 
Oh  /irne  Delay  C  ontioi  t  :  OC  ).  Developed  by  Youcef-Tomni  (1990),  TDC  is  an  indirect 
adapt!  e  •''chnique  t  “sigrir 'd  for  the  class  of  systems  with  discrete  nonlinear  dynamics 
repiesemed  in  the  foil  wii  ;  ionn: 

yit.  ■  ~*g{x(A:),k}  +  h{x()t),A:}4-ru(k)4  il(jt^  (3.1) 

Notatj  nail  ’  a  it  "est  st  known  (modeled)  and  unkiiown  nonlinear  plant  dynamics 
vectors,  .sp  hi  e  vectors  are  functions  of  the  state  x  and  discrete  time  k. 

Furtherm<  e,  >  .■  possibly  time- varying  disturbance  vector.  The  conirol 

vector  is  rt  vest  t  by  mu  nearly  dtrough  the  control  input  matrix  F  oi  die  new 
state.  It  will  be  a  nir  re  im  M,  %  ^section  only)  that  F  is  km  wn  wiciout  enor. 
Section  .  address  tli  uc  v  fi  nccrtainn,  in  F. 
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known  dynamics,  the  e^cimated  unknown  dynamics,  and  die  estimated  disturbances. 
Desired  state  dynamics  can  then  simply  be  "inserted"  along  with  a  proportional  error  term 
to  achieve  desired  tracking  error  dynamics. 

Critical  to  the  TDC  control  law  is  the  method  of  obtaining  the  estimates  of  the 
unknown  dynamics  and  unexpected  disturbances.  By  employing  information  from  the 
previous  time  step,  TlX!  is  able  to  react  rapidi  /  to  changes  in  the  dynamical  behavior  of  the 
plant.  This  characteristic  is  ideal  for  systems  that  operate  in  an  environment  with  large 
variations  in  the  unKnown  dynamics  and  unexpected  disturbances.  However,  this 
beneficial  feature  is  not  without  some  cost.  Since  TDC  basically  "differentiates"  the  state  in 
arriving  at  the  control  action,  any  sensor  noise  affecting  the  observed  values  of  the  state  and 
controi  will  be  amplified,  resulting  in  noisy  control  signals  and  possible  rate  or  position 
saturation  of  the  actuators.  This  effect  translates  into  poor  performance  and  possibly  to 
instabilities.  To  counter  the  effects  of  noise,  filters  are  used.  Although  filters  can 
accommodate  noise,  tiicy  add  additional  lag  which  reduces  die  perfonnance  of  the  adaptive 
system. 

Development  of  TDC 

The  full  development  of  the  TDC  control  law  is  coni  lined  in  Yoiu  ef-Toumi  & 
Osamu  (1990).  For  the  sake  of  completeness,  the  fundamental  t  quations  are  summarized 
below, 

assume  that  the  plant  c  m  be  written  in  die  follow  ing  fomi; 

x{k  ^-\  -<I>x(C)4h{x(itU}  +  ru(/t)  +  d(it)  (3.2) 

where  x  is  an  n  dimensional  stale  vector,  u  is  an  m  dimensional  vector  of  control  inputs,  ‘3' 
IS  an  n  by  n  stale  uansi'.ion  matrix,  F  i^  an  n  oy  m  contiol  weighting  matrix,  and  h  and  d 
are  n  diiiie!ision..r!  unknown  state  dynamics  and  disturbance  vectors  respectively  Notice 
that  f  quati.in  (:^.2)  is  a  .spccia'  case  of  liquatum  (3  1),  since  the  currci,'  .state  a.,  ts  linearly 
111  tlic  new  state  Here,  can  be  viewed  as  the  best  time  mvan  n  linear 
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approximation  of  the  known  function  g{x(A:),kj,  linearized  about  a  selected  operating 
condition.  This  assumption  essentially  shifts  plant  nonlinearities  and  time  dependencies  to 
the  unknown  dynamics  term  h. 

Define  a  desired  n  dimensional  reference  trajectory  Xm  to  be  the  following  linear, 
time-invariant  system: 

+  (3.3) 

where  is  an  n  by  n  reference  state  transition  matrix,  is  an  n  by  m  reference  model 
command  weighting  matrix,  and  r(k)  is  an  m  dimensional  vector  of  reference  commands. 
There  is  no  requirement  that  the  reference  model  be  a  linear,  time-invariant  system. 
Moreover,  it  is  assumed  that  the  reference  command  r(k)  is  constrained  in  a  way  that  the 
uc  icterence  tiajectory  is  achievable  by  the  systc:  described  by  Equation  (3.2). 

The  difference  between  tlie  desired  reference  state  and  plant  state  is  tne  error  vector: 

e(k)  =  x„(k)-x(k)  (3.4) 

The  control  Oi  eertive  of  TDC  is  to  force  this  error  vector  to  zero  with  the  following  desired 
error  dynamics  uefined  in  terms  of  an  error  dynamics  u-ansition  matiix  : 

e(A:4  l)=4»^efA)  (3.5) 

By  expressing  «!><.  in  terms  of  the  error  dynamics  can  be  written  as 

<l>,  =  4’^  4  K  (3.6) 

where  K  can  be  viewed  as  an  error  feedback  matrix. 

I  ll  •  control  signal  u  that  yields  the  desired  error  dynams..  ;  is  htaified  by 
incienienting  Equation  (3.4)  one  time  step  forward  and  substituting  Eque'.ions  (3.2) 
through  (.3  (  •  as  follows 

t  1)  "  i  1)-  x{A:4  1) 

t  IX^)-<l»x(A)-  h|x(A).i}  -ru(A)-d(A) 

(3. 7') 

ru( A )  =  d> .  1 A )  +  r„r(A )  ^  4>^x(A )  •-  h{ x(A ). I }  ■  ^  d( A )  -  <I>,e(  A ) 
ru(A)  -  b]x(A)  4  I\r(A)-  h{x(A),A}  -  d(A)  •  Ki‘(A) 
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Notice  that  the  terms  h  and  d  oii  the  right  hand  side  of  Equation  (3.7)  are 
unknown.  They  will  be  replaced  by  estimates  h  and  d.  In  particular,  if  h  and  d  change 
relatively  slowly,  then  their  estimated  \'aJue  can  be  obtained  by  solving  Equation  (3.2)  at 
the  previous  time  step,  yielding  an  estimate  of  the  sum  of  the  two  terms  h  and  d: 

h{x(k),A:}  +  d(k)  =  h{x(k~i),k-l}  +  d(k-I) 

=  x(^)  -  ^x{k  - 1)  -  ru(k  - 1) 


Here  we  assume  full  knowledge  of  the  state  and  control  values  x(it),  x(k-l),  and  u(it-i). 

Unless  exists,  which  implies  that  n  =  m  so  that  the  number  of  inputs  equals  the 
number  of  states,  Equation  (3.7)  will  not  have  a  general,  exact  solution.  Nevertheless,  an 
approximate  solution  can  be  generated  as  follows 

u(k)  =  - d»]x(k)  +  r„r(A )  h{x(A),A} - d(it) -  Ke(A)]  (3.9) 


where  r+  is  the  pseudo-inverse  of  F.  The  use  ,  f  the  pseudo-inverse  of  the  control 
weighting  matrix  is  necessitated  by  the  fact  that  the  majority  of  control  systems  have  more 
stales  than  controls.  The  following  pseudo-inverse 

(3.10) 

results  in  the  miniriuzation  of  the  Lt  norm  jjiT*  --  ijj.^. 


Substituting  Equations  (3.8)  into  Equation  (3  9)  results  in  the  TDC  control  law 


u(A)  =  -r*Ke(A) 

+r-i4>„-4.]x(*) 

+rT.r(*) 


(error  feedback) 

(state  feedback) 
(command  feedforward) 
(cancellation) 


(3.i  i) 


The  first  lerin  in  Equation  (3  1 1),  error  ffeilbaok,  re[>rcscnts  profxirtional  feedback  of  (be 
error  .»etween  the  desired  and  actual  state  at  lime  k.  3'hc  xhite  feedlxick  term  dctci  mines 
the  contrihution  of  the  state  at  discrete  tune  k  to  the  control  I’his  term  is  a  function  of  the 
dstference  f'etween  the  desired  trajectoiy  dynamics  and  the  !ine:ui/ed  appio.eimation  of  tlic 
pi.  d  .uani  cs.  C.'omma.nds  enter  the  control  law  through  die  command  feedlorwura 
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term.  As  compared  to  the  feedback  terms,  the  command  term  is  feedforward  in  the  sense 
that  it  is  an  open-loop  terai  that  is  no*^  a  function  of  plant  state.  The  cancellation  term 
attempts  to  cancel  the  unknown  dynamics  and  disturbances  at  the  present  time  k  by  using 
approximations  based  on  observed  behavior  at  the  previous  time  k-l. 

3.1.2  Implementation  Issues 

The  design  parameters  of  the  TDC  control  law  include  those  associated  with  the 
reference  model  dynamics  {<I>^,rm}and  desired  tracking  error  dynamics  (or 
equivalently  ±e  error  feedback  matrix  K  =  O,.  -  Om)-  Of  course,  these  parameters  cannot 
be  selected  in  an  arbitrary  iT.,uiner.  As  alluded  to  in  the  previous  section,  TDC  requires  the 
use  of  a  pseudo-inverse  lu  me  control  law  calculation  due  to  the  fact  that  the  majority  of 
systems  have  more  states  than  controls.  Hence,  the  control  weighting  matrix  is  singular 
and  cannot  be  exactly  inverted.  By  inserting  Equation  (3.11)  into  (3.2),  the  following 
constraint  must  be  met  in  order  for  the  plant  state  to  track  the  model  state  with  the  desired 
error  dyna.nics; 

{1  -  rr^  }{[4>„  -  <l»]x(k)  -r  r„r(k)  -  h{xik),k}  -  d{k)  -  Ke(^)}  --^0  (3.12) 

Notice  that  if  T  is  square  and  invertible,  then  Uie  fust  factor  on  the  left  hand  side 
guarantees  that  the  constraint  is  always  met  If  this  is  not  the  case,  then  values  for  the 
design  parameters  <I»m,  and  K  must  be  selected  to  minimize  the  error  of  Equation 
(3.12)  for  arbitrary  r,  h,  and  d.  Altcrnaiivcly,  F^  can  be  selected  so  that  the  nonzero 
second  factor  on  the  left  hand  side  of  Equation  (3.12)  is  in  the  nullspace  of  (I-TF^}. 
However,  the  .:.ppioachcs  for  meeting  the  constraint  of  Ejuation  (3.12)  when  F  is  nen- 
square  an'  generally  difficult 

Beyond  this  constraint  issue,  the  error  feedback  matrix  K  is  chosen  to  achieve  the 
desired  error  dvnauiics  l  ypiealf  ,  error  dynanucs  have  been  chosen  as  a  function  of 
the  retereiicc  model  dyiuniiics  le  g  ,  twice  ;is  first).  However,  other  selections  can  be 
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accommodated  as  long  as  the  enor  dynamics  are  stable. 

Selection  of  a  desired  reference  model  r/n}  is  frequently  application  specific. 
Although  there  is  no  requiiv^ment  on  the  method  used  to  generate  a  reference  model  for  a 
flight  control  application,  typical  design  specifications  are  often  stated  in  terms  of 
characteristics  of  linear,  time-invariant  (LTI)  systems.  For  example,  military  aircraft  must 
meet  MIL-F-8785C  (1980)  specifications  for  natural  frequency  and  damping  ratio  of  their 
characteristic  modes.  Thus,  a  LTI  system  is  often  employed  in  the  role  of  a  reference 
model.  The  reference  model  for  the  aircraft  control  problem  addressed  by  this  thesis  is 
discussed  in  Section  4.3.3. 

3 . 2  LEARNING  CONTROL  S  YSTEM 

The  purpose  of  tiie  learning  system  in  the  hybrid  control  law  is  to  synthesize  a 
mapping  between  the  state  and  controls  of  the  plant  and  an  e  timate  of  the  unknown 
dynamics  h  generated  by  an  adaptive  component.  As  discussed  previously,  connectionist 
networks  have  demonstrated  an  ability  to  leani  highly  nonlinear,  mullivauiable  mappings. 
In  this  section,  the  complete  development  of  the  K'^aming  system  employed  m  the  hybrid 
controller  is  presented. 

.12  1  Incremental  Leainmg  and  F-ixation 

Since  the  objective  of  the  network  in  control  applications  is  to  synthesize  a  mapping 
over  a  continuous  input  space,  the  training  cost  function  in  Equation  (2.2)  cannot  be  used 
directly  (re.,  tlie  iiumlier  of  training  examples  is  not  a  finite  set).  As  a  result,  one  common 
appioach  for  systems  with  a  c onimuons  in|nit  space  is  to  use  iru  rerncnUil  Icarnir.^  (baker 
.V  l-'.inell  (ri^'-M),  Rumelha.’t,  ei  al.  (IdSo)),  Incrernenlal  learning  aigonthnis  seek  to 
rediHT  a  cost  function  dciiiied  in  1'  oirs  of  Ific  current  input  jroint  ratfici  tlian  a  cost  tunciion 
detined  over  a  ti.xed  sel  of  s.implcs  as  in  Equation  (2.2).  I'sing  this  approacli,  the  co.sl 
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function  defined  in  Equation  (2.2)  reduces  to  the  single  term 

J  =  ~[d(x)  -  f^,(x.P)r[d(x)  -  f„,(x,p)]  (3.13) 

where  d(x)  is  the  desired  network  output  at  current  state  x  and  is  the  output  of 

the  network  as  a  function  of  state  and  parameters  p.  The  general  parameter  vector  p  is  used 
in  Equation  (3.13)  to  allow  for  nodal  functions  that  are  not  simply  a  function  of  the  state 
and  weights.  An  incremental  learning  approach  essentially  provides  a  convenient,  point- 
wise  contribution  to  an  aggregate  cost  function  similar  to  Equation  (2.2)  since  it  can  be 
computed  quickly  and  efficiently. 

During  incremental  learning,  care  must  be  taken  to  ensure  that  samples  are 
sufficiently  distributed  throughout  the  input  space  so  that  over  a  finite  period  of  time,  the 
individual  point-wise  contributions  of  Equation  (3.13)  collectively  provide  an 
approximation  to  a  batch-type  tost  function  in  Equation  (2.2).  Since  parameters  are 
updated  at  each  sample,  the  network  reacts  to  mapping'  errors  at  the  current  input. 
Unfortunately,  sigmoidal  networks  possess  a  relatively  high  degree  of  "generalization," 
where  parameter  changes  impact  the  network  mapping  over  potentially  large  regions  of  the 
input  space.  As  a  consequence,  the  localized  nature  of  incremental  learning  can  result  in 
"fixation"  of  the  network,  where  the  network  attempts  to  achieve  an  accurate  mapping  at  the 
current  state,  while  potentially  degrading  an  acceptable  mapping  already  lea-ned  in  other 
regions  of  the  input  space  (Baker  &  Farrell  (1992)). 

The  magnitude  of  the  fixation  problem  is  determined  by  the  rate  of  mapping 
degradation  in  outlying  regions  relative  to  the  time  required  to  receive  samples  from  all 
legions  of  input  space.  This  rate  of  outlying  mapping  degradation  is  in  turn  determined  by 
the  degree  o'"  generalization  and  the  learning  rate.  A  network  with  a  high  level  of 
generalization  requires  rapid  and  extensive  distribution  of  sampling  points  or  a  very  smali 
If  aiTiiiig  rate  to  avoid  problems  associated  with  fixation.  For  control  problems,  the  fonner 
is  generally  not  possible  since  ihe  sampling  orocess  is  constrained  by  the  system  dynamics 
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F’uitheraiore,  extensive  in  /estigation  of  the  state  space  is  typically  inconsistent  v/ith  the 
control  objectives.  This  point  is  most  evident  in  regulation,  wheie  the  goal  is  to  keep  the 
system  near  some  operating  point.  Due  to  such  constraints  on  the  sampling  process,  an 
alternative  approach  to  avoiding  fixation  during  incremental  training  is  to  reduce  the  degree 
of  global  generalization  in  the  network.  Such  spatially  localized  networks  are  discussed  in 
the  next  section. 

3.2.2  Spatially  Localized  L  earning 

The  basic  idea  of  spat  ally  localized  learning  is  that  experience  (learning)  in  a  local 
region  of  the  input  space  should  only  affect  the  mapping  in  that  particular  locality,  with  a 
marginal  effect  in  all  other  areas.  Spatially  localized  learning  prevents  knowledge  that  has 
already  been  collected  in  other  regions  of  the  mapping  iBrom  liecoming  incorrectly  perturbed 
(i.e.,  corrupted).  'Fhis  is  accomplished  by  lessening  the  extent  of  generalization  to  include 
only  a  local  region.  Figure  3.1  illustrates  the  concept  of  spatially  localized  learning.  Let 
represent  the  mapping  to  be  learned,  where  x  is  the  input  vector  and 
p°,. .  are  a  set  of  parameters  to  be  learned  that  define  tlie  mapping. 


Figure  3.1  Spatially  Localized  Ixarning:  the  Ideal  Case 

Figure  3,1  shows  a  region  of  the  domain  D”  of  the  function  to  be  learned  being  map|:)ed  to 
an  associated  region  of  the  range  R".  The  ideal  situation  for  localized  leai-ning,  as  indicated 
in  the  figure,  is  that  tfiis  local  mapping  be  a  function  of  a  subset  of  the  parameter  set 
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'ienoted  p”.  Thus,  the  learning  oased  on  samples  in  [F  will  only  cause  the  subset  of 
parameters  p"  to  change.  Of  course,  this  represents  the  ideal  situation  which  is  not 
practical  for  a  variety  of  reasons.  However,  the  objective  is  tliat  the  result  of  learning  from 
input  samples  in  one  region  of  the  domain  should  not  significantly  alter  previously  learned 
mappings  in  distant  regions  of  the  domain. 

This  local  generalization  property  of  spatially  localized  learning  contiasts  with 
typical  structures  (e.g.,  sigmoidal  networks)  that  are  characterized  by  a  nriuch  larger,  more 
global  generalization.  Tlie  following  discussion  intioduccs  and  develops  one  example  of  a 
spatially  localized  learning  system  that  is  used  in  the  hybrid  control  system.  Learning  is 
accomplished  by  an  incremental  gradient  de.scent  learning  algorithm. 

3.2,3  llic  I.,in.',:,ii-*Gaussian  Neftvork 

One  learning  system  design  that  exhibits  spatially  localizetl  proirerties  is  the  linear- 
Gaussian  network.  The  linear-Gaussian  network  is  an  ex.ample  of  a  local  basis  /  influence 
function  system  (Baker  &  Farrell  (1992),  Millington  (1991)).  The  network  mapping  is 
constructed  from  a  set  of  hyperplanes  that  act  as  "basis"  functions  over  a  localized  region  of 
tlie  input  space.  Altf.ough  many  functions  could  be  used  as  a  local  basis,  byperplanes  offer 
ark  attractive  choice  for  the  conh’ol  pioblem  due  to  their  simple  structui'e  and  similarity  with 
C'  nvenlional  gain  scheduled  mappings.  The  influence  function  associated  with  each  local 
basis  function  is  an  elliptic  hyper-Gaussian.  As  name  suggests.,  tlie  role  of  the 
influence  function  is  to  detennine  the  region  of  applicabiJity  of  a  particular  local  basis 
function  in  the  input  space.  For  e.xample,  a  basis  function  associated  with  a  hyper- 
Gaussian  whose  center  is  very  close  to  the  current  input  plays  a  much  larger  role  in  the 
determination  of  the  output  of  the  mapping  than  a  basis  function  whose  Gaussian  is 
centered  far  away  from  the  current  input.  The  following  discussion  details  the  linear- 
Gaussian  network. 
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Node  Descriptions 


The  local  basis  function  of  a  linear-Gaussian  node  is  formed  by  adding  the 
weighted  sum  of  the  inputs  with  a  bias.  Equation  (3.14)  shows  the  relationship  between 
the  linear  basis  function  L,  and  its  inputs,  x: 

L,{x)  =  W.(x-x„)  +  b.  (3.14) 


Here,  if  n  is  the  numbei  of  node  inputs  and  m  is  the  number  of  node  outputs,  then  L,-  is  an 
rn  dimensional  vector,  x  is  a  ti  dimensional  vector  of  node  inputs,  W/  is  an  mxn  matrix 
v'hose  elements  arc  the  weights  on  the  input,  x,o  is  a  n  dimensional  vector  that  represents 
the  center  of  the  Gaussian  nodal  function  described  below,  and  b,  is  a  m  dimensional  bias 
vector. 


The  linear-Gaussian  node  uses  a  hyper-Gaussian  as  an  influence  function  for  the 
basis  L,  in  Eq*  iario.n  (3. 14'i  The  value  for  the  Gaussian  function  G,-  is  given  by: 


G,(x)  =  exp  ~|(x  -  (D,  )'(x  -  x,„ ) 


(3.15) 


where  D,'  is  a  diagonal  matrir  containing  values  for  the  spatial  decay  of  the  Gaussian,  x^,  is 
the  Gaussian  center,  and  x  is  the  input  vector.  Figure  3.2  contains  an  illustration  of  a 
typical  Gaussian  function. 


Qx) 


Figure  3.2  Gaussian  Function 

The  Gaussian  is  a  continuous  funci'ion  with  finite  asymptotic  values.  Moreover,  a 
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Gaussian  is  differentiable  over  the  entire  input  space,  which  is  important  in  any  learning 
algorithiij  that  relies  on  partial  derivative  information  for  training  (e.g.,  gradient  methods). 
The  output  of  the  linear-Gaussian  node  is  simply  the  product  of  the  linear  basis  function 
and  the  Gaussian  influence  function.  The  general  structure  is  shown  in  Figure  3.3 


Figure  3.3  Linear  Gaussian  Node 

where  2b  represents  a  summing  node  with  bias  and  11  a  multiplication  node.  By  dividing 
the  Gaussian  function  by  the  sum  of  all  the  Gaussians,  the  resulting  quotient  is  the 
normalized  influence  function,  This  relationship  is  shown  in  Equation  3.16  below 

=  016) 

where 

Xr,(x)-1  and  0<r,(x)<l  (3.17) 

1=1 

Combining  Equations  (3.14)  through  (3.16)  yields  the  following  equation  for  the  m 
dimensional  vector  output  of  a  linear-Gaussian  network: 

Y(i)  =  2;L,(x)r,(x)  (3.18) 
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Network  Architecture 

Linear  Gaussian  networks  use  a  feedforward  architecture  and  consist  of  three  main 
layers:  input,  hidden,  and  output  as  depicted  in  Figure  3.4.  The  first  layer  of  the  network, 
the  input  layer,  simply  passes  the  input  values  to  th  subsequent  hidden  layer.  As  one 
would  expect,  there  are  the  same  number  of  input  nodes  as  there  are  network  inputs.  The 
hidden  layer  is  not  directly  observable  to  the  external  environment.  This  hidden  layer 
contains  two  elements,  namely  the  linear-Gaussian  nodes  and  nodes  that  normalize  the 
Gaussian  influence  function.  By  adding  enough  linear-Gaussian  nodes,  a  single  hidden 
layer  network  can  provide  arbitrarily  accurate  function  approximations.  Furthermore, 
multiple  hidden  layers  of  linear-Gaussian  nodes  lead  to  non-localized  mappings.  For  these 
reasons,  only  linear-Gaussian  networks  with  a  single  hidden  layer  are  investigated.  The 
final  layer  is  the  output  layer.  It  contains  as  many  nodes  as  there  are  outputs.  A  typical 
linear-Gaussian  network  is  shown  in  Figure  3.4 


Figure  3.4  Multi-Input,  Single  Output  Linear  Gaussian  Network 
where  the  negative  sign  of  the  rightmost  fl  node  indicates  that  the  argument  is  reciprocated 
prior  to  the  multiplication. 
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The  linear-Gaussian  nef.work  uses  a  supervised,  incremenJal  gradient  descent 
algorithm  to  adjust  the  network  parameters  in  the  negative  direction  of  their  gradient  with 
the  cost  function: 


<9p,. 


a>0 


(3.19) 


where  pi  is  a  vector  of  the  adjustable  parameters  of  the  node  (e.g.,  weight  matrix 
elements,  bias,  spatial  decay,  or  center)  and  /  is  the  cost  at  a  particular  training  sample,  and 
a  is  the  learning  rate.  The  typical  cost  used  for  linear-Gaussian  networks  is  shown  in 
Equation  (3.20) 


J  =  l[d(x)-f,„,(x,p)]''[d(x)-f^(x,p)]  (3.20) 


where  d  is  the  desired  output  as  a  function  of  input  state  x,  and  f  is  the  output  of  the 
network  al  a  function  of  input  state  and  network  parameters  p.  In  minimizing  the  cost  at 
each  step  (i.e.,  for  each  training  sample),  all  of  the  parameters,  or  just  a  subset,  caii  be 
adjusted  using  Equation  (3.19).  The  local  learning  rate  fcr  each  parameter  can  be.  adjusted 
independently  in  order  to  achieve  a  more  rapid  convergence. 

Besides  the  basic  nodal  and  architectural  differences,  the  learning  algorithm  of  the 
linear-Gaussian  network  also  differs  from  that  of  the  classic  sigmoidal  network.  Since  the 
normalized  Gaussian  influence  functions  represent  a  measure  of  the  significance  that  each 
node  has  on  a  particular  value  of  the  input  x  (i.e.,  the  influence  of  each  node  on  tlie  output 
for  a  given  input),  it  is  reasonable  to  eliminate  insignificant  nodes  from  the  learning 
calculation.  Due  to  the  elimination  *  ■  nodes,  the  computational  efficiency  and  heno.e 

the  training  time  of  the  network  are  imnioved  For  example,  for  a  given  input  value  x,  the 
learning  algorithm  might  first  order  the  Gaussian  nodes  by  their  normalized  influence  and 
use  only  enough  nodes  so  shat  the  sum  of  the  normalized  influences  equal.s  or  exceeds 
some  tlireshold  value  (e.g.,  99%).  Since  the  remaining  nodes  have  only  a  minor  effect  on 
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the  local  region,  their  outputs  need  not  be  included  in  the  paiameter  update.  For  large 
network  ,  this  can  result  in  a  substantial  reduction  in  computation. 

ApjjigAliQR.Tssuss 


The  number  of  nodes  needed  by  a  linear-Gaussian  network  is  dependent  on  the 
characteristics  of  the  function  it  is  attempting  to  approximate  and  on  any  requirements 
placed  on  the  desired  rate  of  convergence  and  the  level  of  acceptable  errors  in  the  learned 
mapping.  Although  no  set  of  strict  rules  has  been  developed  for  selecting  the  number  of 
nodes,  several  guidelines  do  exist.  For  functions  that  are  very  smooth,  the  mapping  can  be 
realized  with  relatively  few  nodes  spread  evenly  throughout  the  input  space.  A  large 
number  of  nodes  will  not  improve  this  mapping  and  will  only  serve  to  increase  the  network 
training  time.  However,  more  complex  functions  with  large  local  variations  will  require  a 
large  number  of  nodes,  each  with  a  relatively  small  sphere  of  influence. 

The  sphere  of  influence  of  a  Gaussian  function  is  determined  by  the  spatial  decay 
matrix.  Hence,  the  spatial  decay  matrix  is  a  factor  in  determining  the  size  of  the  local 
regions  in  the  input  space.  If  the  spatial  decay  is  large,  the  transitions  between  the  regional 
linear  basis  functions  wiU  be  more  abrupt  if  the  density  of  basis  functif*ns  is  not  high.  This 
property  is  ideal  for  more  complex  functions.  However,  a  large  spatial  decay  will  require 
many  more  nodes  to  sufficiently  map  the  entire  input  space.  In  contrast,  small  spatial  decay 
rates  result  in  large  regions  of  influence  that  are  ideal  for  smooth,  slowly  varying  functions. 

Initial  values  for  tlie  weights,  biases,  and  Gaussian  centers  must  also  be  selected. 
Tlie  basis  function  described  by  a  weight  matrix  and  bias  vector  represent  a  best  guess  of 
the  desired  mapping  based  on  a  priori  information.  Hence,  values  for  the  weights  and 
biases  can  be  initialized  from  an  exi.stirig  gain  scheduled  controller  or  other  linearizable 
control  law  In  cases  with  considerable  a  priori  knowledge,  the  adjustable  parametei-s  are 
presumably  much  closer  to  their  optimal  values,  and  training  time  is  greatly  reduced.  If  no 
a  priori  knowledge  is  available  about  the  mapping,  tlie  weights  and  biases  are  set  to  zero. 
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The  initialization  of  the  Gaussian  centers  effectively  locates  the  influence  functions  in  the 
input  space,  and  generally,  the  centers  are  placed  so  that  the  entire  input  space  is  spanned. 

In  summary,  a  linear-Gaussian  network  is  one  example  of  a  spatially  localized 
learning  system.  This  network  combines  linear  basis  functions  with  Gaussian  functions  to 
provide  the  properties  of  local  learning  and  the  generalization  properties  of  typical 
connextionist  networks.  Networks  that  employ  spatially  localized  learning  are  required  for 
control  systems  that  regularly  encounter  scenarios  that  might  cause  fixation,  as  described  in 
Subsection  3.2.1.  Although  linear-Gaussian  networks  tend  to  require  more  nodes  and  thus 
more  memory  (due  to  localization),  improved  learning  efficiency  and,  more  importantly, 
better  functional  mappings  can  be  obtained. 

3.3  m’BRID  LEARNING  /  ADAPTIVE  CONTROL 

The  hybrid  control  law  developed  in  this  section  represents  one  approach  to 
combining  a  learning  system  with  an  adaptive  component  with  the  objective  of  improving 
performance  in  the  presence  of  unmodeled  dynamics  and  model  uncertainty.  In 
augmenting  an  indirect  adaptive  controller  with  a  connectionist  learning  system,  the  general 
goal  is  to  develop  a  control  law  that  combines  the  strengths  of  each  component. 

3.3.1  Hybrid  Control  System  Architecture 

Adaptive  control  systems  are  capable  of  controlling  complex  dynamic  systems. 
However,  traditional  adaptive  control  techniques  only  react  (aitei  the  fact)  to  differences 
lietween  actual  and  expected  behavdor  —  they  have  no  mticipatory  capacity.  Ixarning  in 
connectionist  systems  is  fundamentally  a  process  of  ftmetion  approximation.  Hence,  given 
the  vehicle  state  and  the  applied  control  as  an  input  and  the  unknown  dynamics  as  desired 
outputs,  a  connectionist  learning  system  is  capable  of  realizing  a  mapping  of  the  state  and 
control  dependent  dynamics,  'fhus,  by  augmenting  an  adaptive  controller  with,  a  learning 
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system,  it  is  possible  to  anticipate  ihe  state  dependent  components  of  the  plant  dynamics  by 
"looking  up"  the  values  of  that  component  of  the  dynamics  for  the  current  situation  from 
the  network,  instead  of  waiting  for  an  adaptive  component  to  react  to  observed  differences 
between  the  actual  and  exj^ected  state  values.  By  incorporating  a  learning  system  into  the 
control  law.  the  hybrid  controller  is  able  to  use  experientially  gained  knowledge. 
Figure  3.5  illustrates  the  control  system  architecture  of  the  hybrid  adaptive  /  learning 
controller  (Baker  &  Ferrell  (1991)). 


\ 

N 


Figure  3.5  Hybrid  System  Arcnitcctune 

The  role  of  each  compcjnen:  in  the  hybrid  system  is  straightforward.  Ike  adaptive 
provides  an  adaptive  capabiUty  to  accommodate  unmodeled  dynamic  behavior 
dii’''  : .  not  expected  (based  on  the  design  model).  Moreover,  it  provides  a  posterior 
estimate  of  any  unmodeled  state  and  control  dependent  behavior  which  can  be  used  to  train 
the  leaining  system.  The  role  of  the  learning  system  is  to  anticipate  vehicle  liehavior  that 
raries  predictably  with  the  cunent  state  and  control. 

3.3.2  Lxturiing  Versus  Adaptation 

Smat  both  the  adaptive  cotn[H>nent  itnd  tlse  learning  component  in  the  hybrid  control 
Lsystem  are  based  on  parameter  adjustment  algorithtns  that  use  infonnation  gained  by 
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observing  the  closed-loop  behavior  of  the  plant,  one  might  think  it  is  difficult  to  distinguish 
between  the  two  components.  However,  for  the  majority  of  systems,  distinguishing 
qualities  do  exist.  The  following  discussion  presents  the  different  goals  and  characteristics 
of  the  adaptive  and  learning  components  in  an  attempt  to  differentiate  the  two. 

Tlie  adaptive  component  of  the  hybrid  system  can  be  characterized  by  its  reactive 
approach  to  accommodating  local  disturbances  and  apparent  time-varying  dynamics. 
Nonlinearities  that  are  a  function  of  the  operating  condition  of  the  plant  appear  to  the 
adaptive  component  as  time-varying  dynamics  when  they  are  actually  changes  in  the  local 
linearized  behavior  of  a  nonlinear,  time-invariant  plant.  Since  adaptive  controllers  typically 
lack  the  ability  to  associate  the  required  changes  in  the  control  action  as  a  function  of  the 
operating  conditions,  the  controller  must  continually  adapt  to  all  unexpected  effects,  even 
those  which  are  experienced  repeatedly  and  are  actually  due  to  time-invariant  nonlinearities. 
In  other  words,  adaptive  controllers  have  no  ’memory"  and  arc  unable  to  anticipate 
dynamics  that  are  strictly  a  function  of  state.  Thus,  this  lack  of  memory  prevents  any 
anticipatory  action  by  the  controller.  Moreover,  to  prevent  a  situation  where  the  adaptive 
controller  is  continuously  in  some  suboptimal,  partially  adapted  state,  the  generation  of  the 
unknown  dynaimcs  estimate  must  be  relatively  fast  when  compared  to  the  plant  dynamics. 
In  summary,  the  adaptive  component  reacts  to  unexpected  effects  in  order  to  maintain 
locally  desired  behavior;  it  is  best  at  accommodating  novel  situations  and  slowly  tune- 
varying  dyntunics. 

The  reactive  characteristics  of  the  adaptive  component  directly  contrasts  with  the 
constructional  emphasis  of  the  learning  component  In  particular,  the  objective  of  the 
learning  component  in  the  hybrid  control  law  is  to  associate  initially  unknown  slate 
dependent  dynamics  with  the  state  and  control  at  the  current  ojieratmg  condition,  fhe 
a.ssociation  is  essentially  a  memory  function  (or  mapping)  that  stores  exficrientially  gamed 
knowledge.  This  know'ledge  of  originally  unknown  dynamics  can  he  exploited  by  the 
hybrid  control  system  as  a  means  of  anticipating  transient  behavior  instead  of  waiting  to 
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react  to  errors  observed  in  the  output.  Moreover,  this  allows  the  adaptive  component  to 
focus  on  accommodating  slowly  varying  exogenous  (not  state  dependent)  disturbances. 
Since  the  objective  of  learning  is  to  realize  a  mapping  of  state  dependencies  over  the  entire 
operating  envelope,  the  learning  system  is  characterized  by  a  global  optimization  and 
relatively  slow  dynamics.  Table  3.1  summarizes  the  major  differences  between  adaptation 
and  learning  (Baker  &  Farrell  (1991)). 


Table  3.1  Adaptation  vs.  Leaining 


ADAPTATION 

LEARNING 

reactive:  maintain  desired  behavior 

constructional:  synthesize  desired  behavior 

(local  optimization) 

(global  optimization) 

temporal  emphasis 

spatial  emphasis 

no  "memory"  =i>  no  anticipation 

"memory"  anticipation 

fast  dynamics 

slow  dynamics 

The  goal  of  the  hybrid  controller  is  to  combine  the  different  behavioral 
characteiistics  of  (he  adaptive  and  ieaming  components  in  a  synergistic  fashion  Ideally, 
the  adaptive  controller  accommodates  local  unmodelcd  dynamics  and  novel  state 
dependencies,  while  the  leiti^ung  system  i*.  .responsible  for  reducing  state  and  control 
de|XTident  mode!  uncertainty, 

3  3  3  Control  1  .rm'  IX* vclopment 

As  discus, seel  previously,  TlX"  is  one  c.Kaniple  of  a  particularly  simple  indirect 
adaptive  control  approach.  Recall  that  TIX'  calculates  an  estimate  v)f  the  sum  of  the 
unknown  dynaiuics  h  and  disturbances  d  at  the  previous  fiiiic  step  by  examining  the 
difference  in  the  dynanisca.l  Itehavior  between  the  current  state  of  the  plant  and  the  exix'cted 
state  given  the  stale  and  contna*  at  the  |nevioiis  time  step  Assuming  that  h  arul  d  do  not 
cdiangc  ,signiru'a.ntfy  over  a  control  tune  ste[i.  TIK'  uses  this  old  value  of  the  sum  of  h  and 
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d  in  formulaling  the  control  law.  By  integrating  TDC  with  a  learning  component  to  form  a 
hybrid  controller,  this  delay  in  the  estimated  value  of  h  Ccm  be  eliminated.  Although  this 
delay  is  possibly  insignificant  with  short  control  cycle  times  in  the  absence  of  sensor  noise, 
such  is  not  the  case  in  a  more  realistic  environment.  If  the  control  law  is  generated  at  a  fas^ 
rate,  the  unknown  dynamics  and  disturbances  at  the  previous  time  will  accurately  reflect  the 
current  values  (in  the  absence  of  noise).  However,  as  the  cycle  time  is  increased,  the 
potential  for  error  in  the  estimates  grows.  If  sensor  noise  is  present,  it  is  still  possible  to 
predict  the  current  state  within  a  given  tolerar  However,  since  h  and  d  are  essentially 
found  by  taking  the  derivative  of  the  state,  sensor  noise  can  have  a  large  impact  on  these 
estimates  and  subsequently  the  control  generated  by  TDC. 

The  most  common  technique  for  offsetting  the  effects  of  sensor  noise  is  to  use  a 
filter.  Note,  however,  that  filtering  the  noise  only  adds  to  the  delay  already  associated  with 
h  and  d.  For  this  reason,  a  hybrid  approach  can  offer  significant  advantages  due  to  the  use 
of  a  connectionist  learning  system.  Since  sensor  noise  can  alter  the  estimates  of  h  and  d 
significantly,  it  is  possible  to  have  conflicting  desired  output  values  for  the  same  input 
(over  time).  Given  tliis  contrasting  information,  connectionist  systems  tend  to  learn  the 
average  value.  Thus,  if  the  sensor  noise  is  zero  mean,  which  is  the  assumed  case,  the 
coirect  mapping  will  still  be  realized  by  tlie  learning  system.  Since  the  recall  of  tlie  learned 
estimates  of  the  state  dependent  dynamics  is  nearly  instantaneous,  the  hybrid  system  is 
essentially  able  to  remove  the  delay  associated  with  the  adaptive  component. 

As  alluded  to  earlier,  the  hybrid  control  law  can  be  derived  by  augmenting  the  TDC 
equations  with  a  learning  component.  Assume  the  nonlinear  plant  can  be  written  in  the 
following  form: 

x{k  +  1)  =  Ox(k)  +  ru(^:)  f  f^,{x(if),u(it)}  +  k{x{k),u{k),k}  +  d(k)  (3.21 ) 

where  x{k)  is  an  /j  dimensional  state  vector  at  discrete  time  k,  u(k)  is  an  m  dimensional 
control  vector  at  k,  sb  is  an  n  by  n  state  transition  matrix,  F  is  an  n  by  m  control  weighting 
matrix,  h  and  d  are  n  dimensional  unknown  dynamics  and  disturbances  vectors 
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respfictively,  and  f„et  js  the  n  dimensional  learned  component  of  the  state  dependent 
dynamics.  Equation  (3.2  U  differs  from  the  form  of  the  plant  model  used  in  the  TDC 
derivation  only  by  t)ie  learned  dynarmes  term  fnet-  Moreover,  the  unknown  dynamics  term 
is  allowed  to  be  n  function  of  control  as  well  as  state,  essentially  accounting  for  enors  in 
the  assumed  (but  unlikely  to  be)  perfectly  known  F. 

As  before,  the  desired  reference  trajectory  is  given  by 

x.(‘  +  l)  =  *.*.(*) +r.rW  (3.22) 

Here  represents  the  n  dimensional  desired  model  state  vector  at  time  k,  is  the  n 
by  n  state  transition  matrix  defined  by  the  linear  relationship  between  the  current  and  next 
state,  r(k)  is  the  m  dimensional  command  vector  and  F  is  the  n  by  m  model  command 
weighting  matrix.  As  was  the  case  with  the  derivation  of  the  TDC  control  law  in  Section 
3.1.1,  there  is  no  requirement  that  the  reference  model  be  linear.  The  only  requirement  on 
the  reference  model  is  that  the  desired  trajectory  is  achievable,  otherwise  the  control  law 
may  saturate  the  effectors  and  yield  unsatisfactory  performance. 

If  the  difference  between  the  desired  reference  state  and  plant  state  at  discrete  time  k 
is  represented  by  tlie  eiror  vector 

e(k)  =  x„(k)-x(k)  (3.23) 

then  the  control  objective  of  the  hybrid  control  law  is  to  force  this  error  vector  to  zero  with 
the  following  dynamics 

e(k  + 1)  =  +  K]e(k)  =  <P^e{k)  (3.24) 

where  K  is  interpreted  as  the  error  feedback  matrix  and  Of  is  Lhe  desired  error  dynamics 
matrix. 

If  Equations  (3.2 1 )  tiirough  (3.23)  are  substituted  into  Equation  (3  24),  the  control 
signal  u  that  yields  the  desired  error  dynamics  is  obtained  from; 

ru{k)  =  [0„  -  Ojx(^)  +  r„r(^:)  -  f„„{x(>-).u(k)}  -  h{x(^),.t;}  -  d(^)-  Ke(^)  (3 ,25) 
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To  isolate  the  functions  on  the  right  hand  side  of  Equation  (3.25)  that  are  dependent  on  u  at 
the  current  time  k,  approximations  for  the  unknown  dynamics  and  disturbances  as  well  as 
the  output  of  the  connectionist  network  are  made. 

If  h  and  d  change  relatively  slowly,  then  their  estimated  value  can  be  obtained  (as 
before)  by  solving  Equation  (3.21)  at  the  previous  time  step,  yielding 

h{x(jfc),  u(ik),  k}  +  d(/!:)  =  h{x(k  - 1), u{k  - 1),  t  ~  l}  +  d{k  - 1) 

(3.26) 

=  x{k)  -  4>x(k  - 1)  -  ru{*  - 1)  -  f  {x(*  - 1),  u{k  - 1)} 


The  network  function  fnet  in  Equation  (3.25)  can  be  approximated  using  the  first-order 
Taylor  series  expansion  shown  in  Equation  (3.27)  to  isolate  the  linear  dependence  on  u  at 
time  k.  Since  the  network  is  continuously  differentiable,  the  Jacobian  in  Equation  (3.27)  is 
known  to  exist.  Moreover,  this  Jacobian  information  is  already  calculated  since  it  is  needed 
for  the  learning  algorithm  discussed  in  Section  3.2. 


Substituting  these  approximations  into  Equation  (3.25)  and  solving  for  u  at  the 
current  time  k  (using  a  pseudo-inverse)  yields  the  following  hybrid  control  law; 


f„,  (x(k).u(/r)}  =  f^,{x{k),u{k  - 1)} 


+ - Sfc. 


du 


.(u(A:)-u(t-l))  (3.27) 


u(k) 


r+^ 

cm 

<?u 


+fr+^ 

<9u 


^a 


Ke(k) 

r.r(*) 

[h.ai 

f  «,{»(*).“{* -1)} 


(3.28) 


'  u(k  - 1) 


-!) 


The  differences  bet  ween  the  TDC'  control  law  ir.  Equation  (3. 1 1)  and  tlie  hybrid  control  law 
are  the  result  of  the  added  learning  teniis.  The  fifth  term  in  Equation  (3.28)  represent.s  the 
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learned  state  dependent  dynaniics.  The  partial  derivative  of  the  network  output  witli  respect 
to  the  control  used  in  the  pseudo-inverse  calculation  is  a  linear  correction  for  errors  in  F  as 
discussed  below. 

Beyond  removing  the  delay  associated  with  a  purely  adaptive  controllci,  the  hybrid 
control  system  is  able  to  reduce  model  unceitaiaty .  This  is  accomplished  by  using  partial 
derivative  information  for  the  learned  network  term  with  respect  to  the  control  inputs.  For 
example,  if  tliere  are  errors  in  the  coefficients  of  the  assumed  linear  control  weighting 
matrix  F,  or  the  control  aerially  affects  tlie  next  state  in  a  nonlinear  fashion,  the  partial  of 
the  learned  dynamics  with  respect  to  die  control  represents  the  locally  linearized  unmodeled 
effect  of  the  controls.  Assuming  accurate  derivative  information  can  be  obtained  from  the 
network,  the  actual  manner  that  the  controls  impact  the  next  state  is  thus  the  assumed  linear 
control  weighting  matrix  corrected  by  this  learned  effect.  The  technique  of  using  the 
partial  information  to  improve  the  a  priori  design  and  ultimately  reduce  model 
uncertainT)'  represents  a  potentially  major  improvement  over  the  TDC  control  law. 
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4  EXPERIMENTS 


A  learning  enhanced  hybrid  flight  control  system  is  demonstrated  using  the  realistic 
model  of  a  high-performance,  supersonic  aircraft  that  is  described  in  Section  4.2. 
However,  because  the  complexity  of  this  aircraft  model  makes  control  system  analysis 
difficult,  the  hybrid  controller  is  first  applied  to  a  relatively  simple  nonlinear  aeroelastic 
oscillator,  described  in  Section  4.1.  For  this  simple  example,  an  exact;  truth  model  of  the 
nonlinear  plant  dynamics  is  known,  and  the  mapping  that  is  synthesized  by  the  control 
system  can  be  compared  to  the  known  dynamics. 

THie  objective  of  the  experiments  detailed  in  this  section  is  to  illustrate  some  of  the 
propeiTiffs  of  the  hybrid  control  system  In  particular,  the  goal  is  to  demonstrate  the  ability 
of  the  hybrid  system  to  improve  the  control  of  a  aoidinear  plant  with  model  uncertainty  and 
uiimodelcd  dynarmes  that  are  a  function  of  state  and  control.  Both  the  acnoeiastic  oscillator 
and  the  high  performance  aircraft  fall  into  this  category.  A  secondary  objective  is  to 
illuf  trfite  the  learning  characteristics  of  spatially  localized  cormcciionist  networks  when 
applied  to  control  systems. 

Section  4  1  and  Section  4.2  each  begins  with  a  description  of  the  plant  model  of 
intesest  (i.e..  the  aeroelastic  oscillator  and  the  high  performance  aiiciadt)  and  the  physical 
motion  that  the  model  represents.  Idiis  description  is  followed  by  a  brief  discussion  of  the 
open-loop  dynamics  and  other  characteristics  of  that  model.  Next,  the  reference  model, 
along  with  the  motivation  for  its  selection,  is  presented.  Issues  in  applying  the  hybrid 
coatroi  law  to  each  plant  are  also  discussed.  This  development  is  followed  by  two 
exixrjuienLs  for  each  I'tlaril  ifjat  higliiighE  the  capabilities  of  the  hybnd  cootrcllcr. 
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4J  AEROELASTIC  OSCILLATOR 

4.1.1  Description 

The  aeroelastic  o.scillator  models  the  motion  of  a  square  prism  in  a  steady  wind  with  an 
external  control  force.  If  the  aeroelastic  oscillator  is  constrained  to  translational  motion 
normal  to  the  incideirt  wind,  the  dynamics  resemble  a  classic  mass-spring -dashpot  system 
with  an  additional  aerodynamic  lift  force  due  to  an  effective  angle-of-attack  between  the 
wind  and  the  prism  (Parkinson  &  Smith  (1963)).  Figure  4.1  illustrates  the  aeroelastic 
oscillator  model,  where  V(t)  is  the  incident  wind,  L(t)  is  the  aerodynamic  lift  force, /(t)  is 
the  control  force,  m  is  the  mass  of  the  square  prism,  r  is  the  damping  coefficient,  and  k  is 
the  spring  constant.  The  two  state  variables,  position  x(t)  and  velocity  v(r),  represent 
motion  normal  to  the  incident  wind. 

L(t).f(t) 

V( 


Figure  4.1  Aeroelastic  Oscillator  Model 

I'he  aerodynamic  lift  farce  is  a  nonlinear  function  of  the  effective  angle-oftatiack  of 
the  prism  with  respect  to  the  incident  wind.  The  effective  angle-of-attack  is  due  to  the 
motion  of  the  prism  as  illustrated  in  Figure  4.2,  where  a  denotes  the  effective  angle-of- 
attack  and  V’/f£/  is  the  relative  velocity. 
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Figure  4.2  Effective  Angle-of- Attack 

Although  current  aerodynamic  theory  docs  not  offer  an  analytic  solution  for  the  flow 
around  a  square  prism,  experimental  data  has  been  used  to  develop  an  ^proximation  to  the 
coefficient  of  lift  {Ci)  as  a  seventh-order  polynomial  of  the  effective  angle-of-attack 
(Parkinson  &  Smith  (1963)).  Expressions  for  the  coefficient  of  lift  as  well  as  for  the 
resulting  lift  force  L  are  given  by: 


1 


L  =  ^pV^hU\ 


(4.1) 

(4.2) 


where  the  small  angle  approximation  a  =  i  /  V  has  been  used,  p  is  the  air  density,  h  is  the 
side  length  of  the  prism,  and  /  is  the  axial  length  of  the  prism.  The  differential  equation 
governing  the  dynamics  of  the  aeroelastic  oscillator  is: 

d^x  dx 

m — +  - i-kx-L  +  f  (4.3) 

dt^  dt  ^ 


Equation  (4.3)  can  be  nondimensionalized  by  dividing  through  by  kh  and  making 
the  following  change  of  variatles: 


X  = 


phH  ,,  V 

n  =  — — ;  CO--  V  =  —~\ 

2tn  ^|  m  ojh 


b  = 


T= 


h  2fn  y  m  Ojh  2m(o 

Applying  this  change  of  variables  and  substituting  for  the  hft  from  Equation  (4.2)  yields: 

+  2i, "  .  X  -  " Y  -  =^f^Y  +  /  ,4.4) 

j-  ...  ,,  I  I  ij'ydTj  U’ldr' 


dr 


dr 


Equation  (4.4)  can  be  rewritten  in  a  state  space  realization  as: 
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j:,  -  X',  Xj 


dr 


(4.5) 


K{/-2i,)I 


^2. 


(4.6) 


where  a:i  and  X2  are  nondimensionalized  position  and  velocity  states,  respectively. 

Altliough  the  aeroelastic  oscillator  is  a  relatively  simple  second-ordir  plant  with  a 
single  control  variable  (force  f),  it  still  presents  difficulties  to  conventional  control  design 
techniques  due  to  the  nonlinearities  and  uncertain  parametric  values  (e.g.,  A\,A'i,  As,  Aj) 
for  the  lift  force.  For  these  reasons,  the  aeroelastic  oscillator  has  been  selected  to  illustrate 
the  properties  of  the  hybrid  controller. 


4.1.2  Open  Lx)Op  Dynamics 

The  noniinearities  in  the  open-loop  dynamics  of  the  aeroelastic  oscillator  in 
Equation  (4.6)  arc  a  function  of  botii  mass  velocity  and  incident  wind  velocity.  For  low 
incident  wind  velocities,  the  focus  of  the  state  trajectories  in  the  phase  plane  is  stable  and 
the  plant  returns  to  the  origin  after  exogenous  disturbances.  However,  for  higher  wind 
velocities,  the  system  tends  to  oscillate  in  a  stable  liiriit  cycle.  If  the  wind  velocity  is  further 
increased,  state  trajectories  in  the  phase  plane  are  characterized  by  two  stable  limit  cycles 
separated  by  an  unstable  limit  cycle.  Since  tlie  aeroelastic  oscillator  either  returns  to  the 
origin  or  exhibits  a  stable  limit  cycle  in  face  of  disturbances  for  any  value  of  incident  wind, 
it  is  globally  open-loop  stable  and  a  feedback  loop  is  not  required  to  provide  nominal 
(bounded  iisput  /  bounded  output)  stability. 


4.13  Reference  Model 


As  discussed  in  Section  3.3,  the  hybrid  control  law  is  designed  to  cause  the  plant 
state  trajectory  to  follow  a  reference  trajectory  generated  by  a  reference  model.  This 
reference  model  has  a  significant  influence  on  the  performance  of  the  closed  loop  system. 
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since  by  definition  it  represents  the  desired  trajectory  of  the  controlled  vehicle  states.  As  a 
result,  if  an  unsatisfactory  reference  model  is  selected,  the  vehicle  acting  under  the  hybrid 
control  law  will  also  be  unsatisfactory.  Furthermore  if  the  refeicnce  model  demands 
unrealistic  state  trajectories  (e.g.,  reference  trajectories  that  are  chosen  without  regard  to  the 
limitations  of  the  actual  plant  dynamics),  control  saturation  leading  to  inadequate 
performance  or  even  instabilities  (in  the  general  case)  can  occur.  For  tliese  reasons,  the 
reference  model  must  be  selected  to  yield  satisfactory  dynamics  within  the  limitations  of  the 
vehicle  or  plant  as  required  by  specifications. 

For  the  aeroelastic  oscillator,  the  reference  model  was  chosen  to  be  the  linear 
closed-loop  system  that  results  from  applying  an  optimal  linear  quadratic  control  design  to 
the  aeroelastic  oscillator  dynamics  linearized  about  the  origin,  The  quadratic  cost  functional 
weights  states  and  control  equally.  Thus,  the  objective  of  the  hybrid  control  law  is  to  force 
the  true  nonlinear  model  to  behave  identically  to  tlie  linear  reference  model.  Although  not  a 
requirement,  a  linear  reference  model  is  often  used  to  achieve  specifications  (objectives) 
that  have  been  stated  in  tenns  of  natural  frequency  and  damping  ratio  requirements.  The 
reference  model  for  the  aeroelastic  oscillator  has  been  designed  vith  a  natural  frequency  of 
1.12  radians  per  second  and  a  damping  ratio  of  0.76. 

4.1.4  Application  of  Hybnd  Controller 

To  aid  in  the  design,  simulation,  and  analysis  of  the  hybrid  learning  system,  a 
custom-built  software  package  developed  at  Draper  L.aboratory  and  coined  "NetSim"  was 
used,  NetSim  is  a  general-purpose  simulation  and  design  package  that  enables  a  variety  of 
connectionist  learning  control  systems  to  be  developed  interactively  (Alexander,  et  al 
(1991)).  Through  a  graphical  interface,  pre -compiled  code  modules  aie  connected  in  a 
block  diagrammatic  foirnat  to  form  the  desired  system.  For  dynamic  systems,  typical 
modules  include  plants,  transforms  (e.g.,  signal  mcxlifiers  such  as  delays  or  switches), 
suimiiing  and  gain  objects  and  even  dynamic  compensators.  NetSim  al.so  contains  design 
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tools  that  allow  the  user  to  create  connectionist  networks  by  graphically  specifying  the 
network  nodes  and  architecture.  All  of  the  code  modules  are  automatically  linked  together 
at  run  time,  resulting  in  a  complete  system  in  which  the  outputs  can  be  viewed  on-line 
while  the  simulation  is  in  progress. 

The  closed-loop  simulation  for  the  aerof ’astic  oscillator  system  uses  four  main 
modules  as  illustrated  in  the  block  diagram  in  Figure  4.3.  This  figure  is  a  screen  dump  of 
the  actual  simulation  window.  The  main  modules  include  the  reference  model,  hybrid 
controller,  aeroelastic  oscillator,  and  linear-Gaussian  network.  In  addition  to  these  main 
components,  supporting  operators  are  needed  to  modify  the  signals  passed  between  the 
main  modules  to  deliver  the  expected  variables  in  the  proper  time  sequence.  The  arrows 
between  modules  represent  exchanges  of  variables,  and  the  number  in  the  lower  left  comer 
of  each  block  dictates  the  order  of  execution  at  each  time  step.  Modules  called  more  than 
once  per  time  step  are  shown  with  multiple  sequence  numbers. 

Each  module  in  Figure  4.3  performs  a  specific  function  in  modeling  the  closed-loop 
dynamic  system.  The  first  module  in  the  sequence  is  Random.  Random  outputs  a 
randomly  generated  commanded  jxisition  at  the  current  time  k.  This  command  is  held 
constant  for  a  user  specified  amount  of  time.  Once  that  time  has  elapsed,  a  new  command 
is  issued.  AO  Reference  outputs  the  desired  (model)  state  trajectoiy  of  the  aeroelastic 
o.scillator  for  the  given  command,  'fhe  reference  trajectory  is  generated  using  a  discrete 
version  of  the  optimal  linear  design  as  discussed  alx,'ve. 
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ReroelQstk  Oscillator  Dynamic  Simulation 


M  Reference  ^ 
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Figure  4.3  Block  Diagram  of  the  Aeroelasdc  Oscillator  System 

The  AO  Switch  module  supplies  the  network  with  the  state  and  control  at  the 
appropriate  time.  It  also  sends  a  flag  to  the  network  to  insure  that  learning  only  occurs  with 
states  and  control  at  consistent  times  (e.g.,  learning  occurs  when  the  state,  control,  and 
desired  output  are  all  at  the  same  time  instant).  AO  Network  is  a  linear^Gaussian  nefwrk 
that  serves  as  the  learning  system  in  the  hybrid  controller.  The  Multiplexor  (shown  with 
sequence  numbers  5  and  8)  gathers  the  outputs  from  the  network  that  are  needed  for 
implementing  the  hybrid  control  law.  Hybrid  calculates  the  control  signal  based  on  the 
hybrid  control  law  developed  in  Section  ,?.3.  This  control  signal  is  passed  to  the  Aero 
Oscillator  module.  The  Aero  Oscolator  module  contain.s  the  continuous  nonlinear 
equalions-of-raotion  of  the  plant.  The.ve  equations  are  integrated  using  either  an  Euler  or 
order  Runge-Kutte  technique.  The  type  and  rate  of  integration,  as  wcE  as  plant 
parameters  and  initial  conditions,  are  seketed  by  tlie  user.  Table  4.1  summarize.''  the  output 
of  each  module  for  one  lime  step. 
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Table  4„I  Module  Execution  Sumriiary 
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4.1.5  Aeroelastic  QscLUator  Experiment  1 

In  experiment  1,  a  selected  reference  trajectory  was  repeated  continuously  in  order  to  learn 

the  state  dependent,  previously  unknown  dynamics  fnet-  The  control  objective  was  simply 

the  regulation  of  both  states  about  the  origin  given  an  initial  position  of -1  and  velocity  of 

0.5.  By  using  the  geometry  and  velocity  parameters  for  a  particular  incident  wind  velocity 

found  in  Parkinson  &  Smith  (1963),  the  equations- of-motion  used  in  experiment  1  become: 

X, '  0  1  Tx,  "O'  0 

ij “ [-1  L2XxJ'^[l J  '^[-26.1xj^  +  127.3x/  -158.9xj’ 

TTie  nonlinear  terms  Exjuation  (4.7)  were  not  supplied  to  the  control  system  and  represent 
the  unknown  dynamics  in  Equation  (3.21). 

Figures  4.4  and  4.5  illustrate  the  reference  trajectories  for  positi'  n  and  velocity 
(based  on  the  linear  model  described  above)  for  the  selected  initial  conditions  and 
command.  These  reference  trajectories  represent  the  desired  states  at  each  time  step,  and 
any  deviation  from  the  reference  by  the  actual  states  can  be  considered  an  error.  The 
position  and  vel(x:ity  trajectories  of  the  nonlinear  aeroelastic  oscillator  controlled  by  TDC 
alone  (TCXr  Position  /  Velocity)  are  also  shown  in  Figures  4,4  and  4.5  and  are  almost 
indistinguishable  from  'he  reference.  In  this  case,  the  TDC  controlled  trajectories  were 
produced  by  integrating  the  aeroelastic  oscillator  equations-of-niotion  at  200  Hertr.  and 
generating  a  control  signal  at  that  same  rate.  Moreover,  tlierc  was  no  noise  in  the  observed 
state  and  coniro!  values  used  by  FIX'.  Combining  these  facts,  it  is  not  surprising  that  the 
TlX'’  controller  does  extremely  well  in  generating  a  control  law  that  dnves  the  plant  along 
the  reference  trajectory.  Indeed,  because  of  the  cxiremeiy  small  time  step,  the  unknown 
dynamics  observed  at  the  previous  time  provide  an  accurate  estimate  of  the  unknown 
dynamics  a'  the  cunent  time  that  is  required  by  the  TDC  control  law.  Also  plotted  in 
Figuscs  4.4  and  4..'i  are  the  trajectories  geneialed  using  the  constant  gains  of  the  linear 
coniioi!cr  used  to  form  the  linear  reference  model  ami  applied  to  the  actual  nonlinear 


ATTACHMENT  2 


acroclastic  oscillator  (labeled  Linear  Position  /  Velocity)-  Errors  between  trajectories  under 
this  linear  control  and  the  reference  trajectory  are  due  primarily  to  the  nonlinear 
aerodynamic  lift  force.  These  plots  show  tlic  degree  of  performance  improvement  (relative 
to  a  linear  feedback  law)  that  is  possible  with  an  adaptive  controller  operating  under  ideal 
conditions. 


Position  vs  Time 


0  1  2  3  4  5  6 

Time 


Figure  4.4  Position  trajectories  for  the  reference  model,  linear  control  law,  and  TIXT  at 

2tX)  Hertz  controller  rate 
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Velocity  vs  Time 


Time 

Figure  4.5  Velocity  trajectories  for  the  reference  model,  linear  control  law,  and  TDC  at 

200  Hertz  controller  rate. 

Since  the  sensing  and  computational  requirements  as.sociated  with  generating  state 
information  for  the  aeroelastic  oscillator  at  the  200  Hertz  integration  rate  may  be  unrealistic, 
the  controller  is  slowed  to  calculate  the  control  signal  at  a  more  moderate  rate.  For  this 
experil,  ent,  the  control  was  generated  at  10  hertz.  In  order  to  produce  unknown  dynamics 
that  are  a  function  of  control  as  well  as  state,  an  unknown  external  force  equal  to  thre.e 
times  the  control  force  was  added  to  the  unknown  dynamics.  In  other  words,  the  control 
form  in  Equation  (4.7)  was  inodifed  from  the  assumed  known  value 


to  the  applied  value 


where  the  aoded  term  is  not  known  by  the  controller.  A  relatively  large  force  error  was 
used  to  highlight  the  ability  of  the  hybrid  control  system  o  reduce  large  uncertainties. 

Consistent  with  the  hybrid  control  law  develop>ed  in  Section  3  3,  the  learning 
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system  used  a  spatially  localized  network  with  32  linear-Gaussian  nodes.  For  training  the 
network,  a  learning  rate  of  1  was  used  with  the  spatial  decay  of  all  the  nodes  fixed  at  2. 
These  values  were  selected  based  on  the  known  mapping  of  the  nonlinearities,  the  size  of 
the  input  space  (i.e.,  range  of  all  possible  position,  velocity,  and  control  combinations), 
and  to  a  certain  extent  on  trial  and  error.  Initial  values  for  the  slopes  and  biases  were  set  to 
zero  while  the  Gaussian  centers  were  placed  randomly  m  the  unit  cube  formed  by  scaling 
the  state  and  control  inputs.  Figures  4.6  and  4.7  illustrate  the  hybrid  controlled  states  for 
the  first  learning  trial  compared  to  tlie  TDC  controlled  states  and  reference  model.  Since  the 
slopes  and  biases  of  the  learning  system  are  initialized  to  zero,  the  learning  system  does  not 
impact  tlie  states  at  start-up,  and  all  of  the  unknown  dynamics  are  incorporated  into  the 
TDC  adaptive  component.  However,  after  a  short  time  (within  the  fnst  trial),  the  learning 
system  begins  to  build  a  mapping  of  the  unknown  dynamics.  This  mapping  is  u.sed  to 
eliminate  the  delay  associated  witli  the  unknown  dynamics  estimate  in  TDC  and  to  improve 
the  estimate  of  the  local  linearized  behavior  (i.e.,  using  the  derivative  information  as 
discussed  in  Section  3.3  to  reduce  model  uncertainty).  These  features  can  be  directly 
related  to  the  improved  performance  seen  in  the  state  tracking  of  the  reference  trajectory. 
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Positioss  vs  Time 


Figure  4,6  Position  trajectories  for  the  reference  model,  TTK!!,  and  hybrid  control  law  at 

10  Hertz  controller  rate,  first  learning  trial. 


Velocity  vs  Time 
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Figure  4.7  Velocity  trajectories  for  the  reference  model,  TIX^,  and  hybiid  control  law  at 

10  Hertz  controller  rate,  first  learning  trial. 

After  the  trajectory  is  repeated  10  times,  the  leaniing  system  has  built  a  mapp.ing  of 
the  previously  unknown  dynamics  as  a  function  of  the  state  fuiu  control  along  tha* 
trajectory  Figure  4  8  compares  the  es  mate  of  the  unknown  dynamics  used  by  TDC  to 
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that  of  the  hybrid  controller  generated  from  the  learned  mapping  (after  10  trials).  Since  the 
mapping  used  to  generate  these  points  represents  a  static  function,  the  unknown  dynamics 
can  simply  be  looked  up  as  a  function  of  the  cun'ent  state  and  control  This  can  be 
contrasted  with  TDC  which  uses  an  estimate  of  the  unknown  dynamics  based  on  the  state 
and  contj'ol  at  the  previous  time. 


Unknown  Dynamics  vs  Time 


Figure  4.8  UiVknown  dynamics  estimate  from  network  and  TDC  after  10  trials. 

As  discussed  in  Section  3.3,  the  hybrid  centroii'er  uses  the  output  of  the  network 
as  well  as  the  derivative  of  the  network  output  with  respeot  to  the  conuol  (dfuci^du)  to 
formulate  the  control  law.  This  derivative  infommatiou  provides  local  improvemenLs  to  the 
linear  control  weighting  vector,  F.  Sitice  the  truth  model  for  the  aeroelastic  oscillator  is 
known,  it  is  possible  to  analyze  the  accuracy  of  the  derivative  information.  For  example, 
the  partial  of  the  unknown  dynamics  with  respect  to  the  control  force  is  simply,  in 
continuous  time,  the  constant  tliree  (due  to  the  added  external  control  force).  When 
converted  to  di.screie  time,  this  value  is  0.3182.  After  10  trials,  the  networks  mean  value  of 
the  is  0.2285.  Although  it  has  not  yet  leaimd  the  conect  value,  it  nonetheless 

provide,s  .some,  linproveme'it  to  the  control  weighing  matrix, 
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Figures  4.9  and  4.10  illustrate  the  state  trajectories  controlled  by  the  TIX'  controller 
and  the  hybrid  controller  after  10  trials.  Clearly;  the  hybrid  corilroller  uses  expericntially 
gained  knowUidge  to  improve  the  tracking  of  the  reference  states. 


Position  vs  Time 


Figure  4.9  Position  trajectories  for  the  reference  model,  ITX,  and  hybrid  control  law  at 

10  Hertz  controller  rate,  10  trials. 
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Velocity  vs  rime 


Figure  4.10  Velocity  trajectories  for  the  reference  model,  TDC,  and  hybrid  control  law  at 

10  Hertz  controller  rate,  10  trials. 

This  ex,  eriment  shows  that  the  hybrid  controller  has  the  ability  to  improve  tlie 
controlled  performance  of  the  aeroelastic  oscillator  when  a  specific  trajectory  is  repeated 
numerous  times.  This  improved  performance  is  realized  by  exploiting  a  learned  fuiictional 
ri.apping  uf  the  previously  unknown  model  dynamics  to  ivnprove  the  control  law.  The  next 
exf^»eriment  illustrates  the  ability  to  synthesize  a  mapping  over  a  much  larger  input  space, 
using  randomly  generated  state  trajectories. 

4.1.6  Aeroelastic  Oscillator  Experiment  2 

In  exfKriment  2,  the  desired  L'ajexrtory  is  selected  in  a  rar'doi.i  fashion  in  order  to 
map  the  unknown  dynamics  over  a  much  larger  region  of  lire  state  space  than  the  single 
trajectory  ia  experiment  1.  By  commanding  a  random  position  between  -1  and  1,  a  large 
portion  of  the  state  space  along  with  the  a.ssociated  controls  is  visited  and  subsequently 
inapf)e:d.  As  in  experiment  1,  Uit  aeroelastic  oscillator  was  integrated  at  200  Hertz  and  the 
control  '■'igiiai  issued  once  every  20  integrations  ( 10  hertz).  For  this  experiment,  a  spatially 
kxjalized  learning  •;ysleni  with  99  linear- Gaussian  nodes  was  used.  The  spatial  decay  for 
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each  node  was  fixed  to  1  and  the  initialization  was  the  same  as  for  experiment  1 .  The 
number  of  nodes,  spatial  decay,  and  other  parameters  were  again  selected  based  on  the 
expected  nonlinearities,  size  of  the  input  space  and  trial  and  error. 

Figure  4.1 1(a)  shows  tlie  mapping  synthesized  by  the  learning  system  as  a  function 
of  velocity  and  control.  Learning  was  based  on  following  the  randomly  generated 
reference  trajectory  for  60  seconds  (10  trials).  This  mapping  is  compared  to  the  nonlinear 
terms  and  extraneous  control  of  the  aeroelastic  oscillator  truth  model  shown  in  Figure 
4.1 1(b).  Comparing  the  ^vo  plots,  the  slope  in  the  control  direction  (force)  for  the  network 
mapping  is  nearly  constant  with  a  mean  of  0.3120  juid  standard  deviation  of  0.0264 
whereas  the  actual  slope  is  0.3182  (in  the  discrete  time  model).  Moreover,  the  mappings  in 
the  velocity  direction  appear  very  similar.  Hence,  the  network  has  synthesized  the 
previously  unknown  dynamics  of  the  system.  (Note:  tlie  cun-ent  version  of  the  software 
does  not  allow  a  direct  error  surface  plot  to  be  generated.) 
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Unmodeled  Dynamics 


vel 


Actual  Unmodeled  Dynamics 


velocity 


Figure  4.11  (a)  Network  Mapping  of  Unknown  Dynamics  (b)  Actual  Unknown 

Dynamics. 

Figures  4.12  and  4.13  illustrate  the  position  and  velocity  trajectories  for  the  TDC 
and  hybrid  controlled  states  after  30  seconds  of  simulation.  As  predicted  by  the  relatively 
accurate  mapping  of  the  unknown  dynamics,  the  position  and  velocity  show  improved 
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performance  for  the  hybrid  controlled  aeroelastic  oscillator  over  that  of  TDC. 


Position  vs  Time 


Figure  4.12  Position  trajectories  for  the  reference  model.  TDC,  and  hybrid  control  law  at 

10  Hertz  controller  rate,  10  trials. 


Velocity  vs  Time 


Figure  4.13  Velocity  trajectories  for  the  reference  model,  TlX',  and  hybrid  control  law  at 

10  Hertz  controller  rate,  10  trials. 
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4.2  HIGH  PERFORMANCE  AJRGRAFT  MODEL 

4.2.1  Aircraft  Description 

The  high  performance  aircraft  model  that  is  used  to  illustrate  the  concept  of  learning 
enhanced  flight  control  was  developed  by  NASA  to  provide  the  aeronautical  community 
with  a  common  focus  for  research  in  flight  control  theory  and  design.  This  model  is  also 
being  used  to  serve  as  a  basis  for  the  1972  AIAA  Control  Design  Challenge  (Duke  (1992)). 
A  complete  description  of  this  generic,  high  perfomiance,  state-of-the-art  aircraft  model  is 
found  in  Brumbaugh  (1991).  The  following  sununarizcd  the  major  characteristics  of  the 
aircraft  model  as  well  as  its  critical  components. 

llie  NASA  model  is  the  basis  for  the  simulation  of  a  high-performance,  .supersonic 
vehicle  representative  of  modem  fighter  and  attack  aircraft.  This  mode!  supports  kditizalJy 
all  missions  in  nonterminal  flight  phases.  These  missions  include  flight  phas^js  that  are 
normally  accomplished  using  gradual  maneuvers  such  as  climb,  cruise,  or  loir.ir  as;  well  as 
phases  that  require  rapid  maneuvering,  precision  tracking,  or  precise  flight-path  control 
(e.g.,  air-to-air  combat,  weapon  delivery,  or  terrain  following)  T!ie  aircraft  model 
includes  full-envelope,  nonlinear  aerodynamics  in  addition  to  a  full -envelope,  nonlinear 
thrust  model.  An  illustration  of  the  basic  configuration  of  the  aircraft  is  shown  below  in 
Figure  4.14.  Significant  features  of  this  aircraft  configuration  include  a  single  vertical  tail 
with  rudder  surface,  a  horizontal  stabilator  capable  of  symmetric  and  differential 
movement,  and  conventional  trailing  edge  ailerons. 
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Aileron 

\ 


Figure  4.14  NASA  High  Perfbnnance  Aircraft  Model 

The  basic  geometry  and  mass  properties  of  the  aircraft  aro  summarized  below  in 
Table  4.2. 


Table  4., 2  Basic  Aircraft  Geometry  and  Mas;j  Properties 


- .nrTT-rr - -r-  - 

1  Aircraft  Geometry  and  Mass  Properties  f 

Wing  Area 

608.00  ft2 

Wing  Span 

42.80  ft 

Mean  Chord 

15.95  ft 

Weight 

45000.00  lb 

To  aide  in  the  design  and  development  of  a  competent  tlight  control  law,  the  model 
can  be  easily  broken  into  separate  components,  each  performing  a  specific  function,  'fhe 
major  components  of  the  aircraft  model  ai'e  as  follows:  aerodynamics,  propulsion,  actuator 
dynamics,  and  equations-of-niotion.  Also  included  with  the  model  is  the  standard 
atmosphere  component,  an  environmental  model,  and  the  integration  component  that  is 
used  to  simulate  the  aircraft  in  software.  Of  course,  one  element  that  is  not  in  this  list  is  the 
flight  control  law,  which  is  to  be  determined  by  the  designer.  The  function  of  each  of  the 
major  components,  as  well  as  a  biief  discussion  of  its  origins,  are  presented  in  the 
following  paragraphs. 

As  alluded  to  previously,  the  NASA  aircraft  simulation  contains  a  nonlinear,  fiill- 
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envelope  aerodynamic  model.  The  primar>'  function  of  tiiis  component  is  the  calculation 
of  aerodynamic  forces  and  moments  generated  by  the  aircraft  tliroughout  its  flight  regime. 
In  general,  the  aerodynamic  forces  and  moments  are  complicated,  nonlinear  functions  of 
many  variables.  The  approach  taken  by  the  NASA  model  in  calculating  the  complex  force 
values  is  based  on  modeling  the  force  terms  as  the  product  of  dynamic  pressure,  a  refercnce 
area  (wing  area),  and  an  appropriate  dimensionless  aerodynamic  coefficient.  Similarly,  tlie 
aerodynamics  moment  teno  is  modeled  as  the  product  of  dynamic  pressure,  a  reference 
area,  a  dimensionless  aerodynamic  coefficient,  and  a  reference  length  (mean  chord).  The 
aerodynamic  coefficients  are  primarily  functions  of  Mach  number,  angle-cf-attack,  and 
sideslip  angle.  Tlie  NASA  aircraft  model  acquires  coefficient  values  from  multidimensional 
data  tables  or  from  direct  calculation.  The  coefficients  contained  in  the  tabular  data  have 
been  generated  'ough  .a  combination  of  wind-tunnel  tests  and  computer  programs  that 
numerically  inleg.  e  the  tiieoieiic£d  aercdynamic  pre.ssuie  over  the  suiface  of  the  aircraft. 
For  the  tabular  data,  linear  mterix>iation  is  employed  to  obtain  intermediate  Tp'alucs. 

'^/ebic''e  tlimst  is  generated  by  itic  propulsion  mcnieL  Twin  afterburning  tu«bof.im 
engines,  each  capable  of  generating  3.?..(K)0  pounds  of  thniSt,  deliii'cr  power  to  Ihe  aircraft. 
Each  cmiiV'C  thrust  vector  acts  along  the  aircraft  x-body  axi.s  at  a  pcint  10  feet  fichind  the 
center  of  g/avity  jind  4  fc«t  laterally  from  the  centerline.  Engine  dynaruics  are  modeled  by 
separating  the  pKiwcrplant  into  two  separate  sections.  T!ic  first  section,  the  engine  core,  is 
modeled  as  a  lirst-ortler,  closed-loop  system  ihat  outputs  thrust  for  a  given  throttle  input, 
Morcovci  ,  rate  limits  that  simulates  sjiool  up  cffect.s  and  a  gain  scheduler  that  models 
chao,‘/  ss  in  |X'>formance  due  to  Mach  nunilscr  and  altitude  changes  have  l)ccn  added  to 
pro  ' 1,1  ivahsin  to  the  closed- loop  system.  Ciains  are  obtained  from  tabular  data  and  a 
line--!  i.)  lo (xdation  routine  based  on  Mach  nunitxm  and  altitude.  A  second  section,  the 
afn  ,  '  ‘s  modeled  wjih  similar  first-order  dynamics  hut  has  the  added  features  of  a 

r  ite  i  itiiici  anci  sequencing  logic  to  .model  fuel  pump  and  pressure  regulator  cffect.s. 

'1  o;s'  Jier ,  these  componeni;;.  -ompri.se  'he  .fuU-envelorx  ,  noiUinear  thrust  model. 
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Atinospheric  parameters  required  by  the  aircraft  simulation  are  computed  by  the 
standard  atmosphere  model.  For  a  given  altitude,  values  for  acceleration  due  to  gravity, 
speed  of  sound,  temperature,  and  other  essential  parameters  are  geneiated  from  tables 
based  on  the  IJ.S.  Standaj;d  Atmosphere  of  1962.  Linear  interpolation  is  used  between 
elements  of  the  table. 

The  actuator  dynamics  model  is  a  first-order  system  that  outputs  surface  position 
for  a  given  surface  command.  Furthermore,  rate  and  position  limits  are  included  in  the 
system.  All  actuators  are  considered  to  be  identical. 

The  dynamics  of  the  aircraft  are  simulated  using  the  equations-of-motion  module. 
The  nonlinear  equations-of-motion  are  derived  from  the  general  six-degree-of-freedom 
relations  for  a  rigid  aircraft.  Beyond  the  rigid  body  assumption,  it  is  also  assumed  that  the 
vehicle  is  traveling  with  nonzero  forward  motion  in  an  atmosphere  that  is  stationary  with 
respect  to  an  Earth-fixed  reference  frame.  The  non.zero  forward  motion  assumption 
mandates  that  only  nontenninal  flight  phases  be  simulated  by  this  model.  Since  each 
degree  of  freedom  requires  two  state  variables  (the  basic  variable  and  its  rate),  a  total  of 
twelve  first-order  differential  equations  are  required  to  completely  describe  the  motion  of 
the  aircraft.  Table  4.3  lists  each  state  variable  and  its  symbol.  Note  that  if  speed  is 
as.suined  to  be  relatively  constant,  then  angle  of-attack  and  sideslip  angle  may  be 
supplemented  for  the  y  and  z  body  axis  velocity  vector  projections  respectively.  A  detailed 
denvation  of  these  equations-of  motion  can  be  found  in  Etkin  (1982)  or  Ro.skam  (1979). 

The  state  variables  are  projiagated  in  tinic  via  the  integration  module.  This  module 
uses  a  second-order  Rungc-Kutta  luidjKiint  algorithm  to  arrive  at  a  new  state  based  on  the 
state  and  conirol  at  the  previous  time.  Running  at  .'iO  Hertz,  this  integration  technique  has 
Ix'en  found  to  provide  a  balanced  tradeoff  between  numerical  stability  and  prtK'es.sing 
s[x-ed, 
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Table  4.3  The  twelve  aircraft  state  variables  and  symbols 


j  State  Variables  and  Symbols 

Displacement  North 

X 

Displacement  East 

y  .  .  _ 

Altitude 

h 

Velocity 

u 

Angle  of  Attack 

a 

Side  Slip  Angle 

P 

Pitch  Angle 

e 

Roll  Angle 

. . 

Yaw  Angle 

¥ 

Roll  Rate 

P 

Pitch  Rate 

. 9_ . . . 

Yaw  Rate 

r 

An  auxUiaiy  component  of  the  iiircraft  mcxlel  iliat  is  not  critical  to  the  simulation  but 
is  invaluable  to  the  control  law  designer  is  the  observation  model.  The  function  of  the 
observation  model  is  to  output  a  large  class  of  aircraft  measurements  States,  state 
derivatives,  accelerations,  airdata  parameters,  force  parameters,  and  a  multitude  of  other 
irnfxirlanr.  data  aie  furnished  for  obscrv'ation.  Of  these  parameters,  state  information  as  well 
as  vehicle  Ixidv  axis  rate  -,  make  up  the  set  of  parameters  that  have  been  traditionally  used 
foi  feedback  flight  contR)!. 

B>'  linking  the  previously  destmioed  modules,  a  realistic,  hrgfily  complex,  nonlinear 
am ‘'aft  iiiodel  (hat  poses  formidable  challenges  to  thir  tlight  control  designer  is  as;xmblcd. 
Tin*  hipl:'.  jKofoniumcc  aircraft  cmnputer  niodcl  receiver/  from  NASA  was  written  in  the 
lOR  TRAN  programnvit’g  language,  li,  order  to  prcxlucc  a  iiuxlel  compatible  with  the 
NclSmi  simulation  and  design  pac  kage  disciissed  m  Seclion  4. 1.4,  tins  I'OR’rRAN  veroon 
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was  transposed  into  the  C  programming  language  by  the  author. 

4.2.2  General  Aircraft  Characteristics 

The  flight  envelope  of  the  NASA  aircraft  model  is  characteristic  of  a  high 
performance  fighter  aircraft  (Brumbaugh  (1991)).  Figure  4.15  below  illustrates 
approxiinate  bounds  of  aircraft  operation  in  terms  of  altitude  and  Mach  number. 


NASA  Aircraft  Flight  Envelope 


Figure  4.15  1-g  Aircradt  Fhght  Envelope. 

To  examine  the  nonlinear  dynamics  of  a  complex  aircraft  model,  the  equations-of- 
moiion  are  frequently  linearized  about  various  operating  conditions.  By  linearizing  the 
dynamics  at  a  sufficient  number  of  oficrating  conditions  within  the  envelope,  an  in'.,?rovcd 
overall  picliirc  of  the  actual  nonlineai'  dynamics  can  be  gained.  Generally,  operating 
cenditions  near  the  boundary  of  the  envelo[>e,  as  well  as  a  few  centrally  located  points,  are 
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selected.  The  most  conunon  tecimir  j  in  obtaining  linearized  dynamics  is  by  invoking  the 
small  perturbation  theory  based  on  a  Taylor  series  expansion.  This  theory  uses 
infinitesimal  perturbations  fiom  an  equilibrium  oi  trimmed  steady-state  reference  condition 
to  predict  aircraft  response  to  ^jerturba'  ons  that  are  not  infinitesimal.  A  trim  condition  is 
classically  defined  as  a  constant  velocity  and  dtitude  state  with  control  surfaces  and 
tluottles  set  to  maintain  this  condition.  If  it  is  assumed  that  ail  perturbations  and  their 
derivatives  are  small,  the  quadradc  and  higher  order  prolucts  of  the  perturbations  will  be 
negligible  comparts  to  tlie  first-order  quantities.  In  other  words,  a  linear  model  is  obtained 
by  deriving  relations  of  small  deviations  of  all  state  and  control  variables  about  a  steady- 
state  equilibrium  condition  c  i  retaining  linear  terms  tvhile  ignoring  quadratic  and  higher- 
order  terms.  A  detailed  version  of  the  following  short  derivation  of  this  theory  can  be 
found  in  (  Athans  (1990)) 

Let  x(t)  and  u(/)  represent  state  and  control  variables,  respectively,  with 

x{r)e5K"  (4,8) 

u(r)  e  (4.9) 

The  nonlinear  state  dynamics  in  continuous  time  are  given  by 

x(r)  =  f{x(r),u(r)}  (4.10) 

The  reference  state  and  contiol  values  representing  an  equilibrium  condition  (c.g., 
x(r)  =  0)  for  the  nonlinear  equation  are  denoted  by  a  subscript  zero. 

0-f{xo,Uo}  (4.11) 

Small  perturbations  about  the  equilibrium  condition  are  denoted  with  a  lower  case  delta; 

x(r)  =  Xq  +  ^x(r)  (4.12) 

x(r)-<5;s(/)  (4.13) 

u(r)=:Uo  +  ^u(r)  t4  14) 

Expanding  the  state  dynamics  in  a  Taylor  senes  about  the  equilibrium  condition  iuid  solving 
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for  8x(t)  while  retaining  linear  terms  and  disregarding  higher-order  terms  yields  ^he 
following  state  perturbation  dynamics: 

5x(r)  =  Ao&(t)  Bo8u(r)  (4. 1 5) 


where 


31. 


(3o),  = 


duj{t) 


(4.16) 


(4.17) 


Ao  and  B©  are  the  Jacobian  matrices  of  the  Taylor  series  expansion  of  f{x(t),u(t)}  centered 
about  Xq  and  Uq.  Although  the  Jacobian  matrices  can  occasionally  be  found  in  closed  form 
for  relatively  simple  systems,  more  complex  systems  often  require  numerical 
differentiation.  For  this  reason,  numerical  differentiation  is  used  for  the  aircraft  model  to 
calculate  the  Jacobian  matrices. 

Using  the  small  peituibation  theory  to  linearize  the  equations-of-motion  about  an 
equilibrium  condition  can  provide  insight  into  the  local  behavior  of  the  nonlinear  aircraft 
dynamics  in  terras  of  stability,  transient  responses,  and  other  system  characteristics. 
However,  this  theory  is  not  without  its  limitations.  Large  numbers  of  linear  nrodels  m.ust 
be  computed  to  characterize  the  dynamics  in  highly  nonlinear  regions  of  the  flight  envelope. 
Moreover,  the  small  perturbation  theory  is  ill-suited  to  handle  phases  of  flight  where  large 
deviations  from  the  nominal  trim  condition  are  encountered  (i.e.,  high  angle-of-attack  fliglit 
or  spinning  maneuvers). 

Linearizing  the  equations-of-motion  of  the  NASA  aircraft  has  revealed  that  tfic 
longitudinal  dynanrucs  are  only  lightly  coupled  witii  the  lateral  dynamics  at  the  majo'  ity  of 
flight  conditions.  Moreover,  control  of  the  longiturlinal  or  pitcliing  motion  is  dominatai  by 
symmetric  movement  of  the  horizontal  tail  and  engine  thrust  whereas  the  rolling  and 
yawing  motions  a:;sotiated  w  ith  the  lateral  dynamics  are  most  heavily  influeaced  by  the 
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ailerons  and  differential  movemenLs  of  the  horizontal  tail.  For  this  reason,  the  aircralt  flight 
control  design  problem  can  be  sepai-ated  into  two  distinct  problems,  each  less  complex  than 
the  whole.  The  existence  of  the  lightly  coupled  inodes  and  the  ability  to  dcx'ompose  the 
control  system  design  is  common  to  all  but  the  most  unconventional  aircraft. 

The  uncoupled,  linearized  longitudinal  dynamics  of  the  aircraft  can  b;;  described  by 
a  total  of  five  coupled  linear,  time-invariant  differential  equations  tliat  are  a  function  of  pit  ih 
rate,  velocity,  angle-of-attack,  pitch  angle,  end  altitude.  If  the  linearized  equation  for  fee 
dynamics  of  the  total  tlirust  in  the  longitudinal  direction  is  added  to  this  set  of  variables,  fee 
state  of  the  aircraft  for  longitudinal  motion  is  as  follows  (where  T  i>  total  thrust): 

x  =  [^  u  a  q  h  7]  (4.18) 

If  the  dynamics  of  the  inertial  altitude  and  thrust  are  temporarily  neglected,  the  fotrr 
remaining  differential  equations  define  the  traditional  natural  modes  associated  with  aircraft 
pitching  motion,  namely  tlie  short  period  and  phugoid  mooes.  Tlie  short  period  mode  is 
characterized  by  a  highly  damped,  high  frequency  oscillation.  The  short  period  oscillations 
represent  changes  in  angle-of-attack  ai.d  pitch  angle  with  near  constant  trim  speed.  In 
contrast,  the  phugoid  mode  exhibits  very  lightly  damped,  low  frequency  oscillations  when 
excited.  Under  the  influence  of  fee  phugoid  mode,  the  angle-of-attack  remains  essentially 
constant  while  the  speed  and  pitch  angle  experience  changes.  This  motion  represents  a 
continual  exchange  of  kinetic  md  potenti-al  energy  c  f  a  slowly  rising  and  falling  airplane. 
Table  4.4  contains  the  natural  frequencies  {(On)  as  well  as  the  damping  ratios  (^)  for  the 
open-loop  longitudinal  modes  (sp  -=  short  period,  ph  =  phugoid)  of  the  NASA  aircraft 
mtxie!  a!  four  equilibrium  points  (trim  conditions)  near  the  subsonic  boundary  of  the  fliy^ht 
envelope.  Trim  condition  5,  which  is  not  on  the  boundary,  is  included  since  it  will  be  used 
as  the  initial  condiLon  for  experiments  de.scribed  in  Sections  4.2.5  and  4.2.6. 
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Table  4.4  High  jxirfonHance  aircraft  longitudinal  modes  at  vaiious  altitude  and  Mach 

number  trim  conditions. 


Natural  Frequency  and  Damping  Ratio:  Lx)ngitudina!  Modes 


1  - - - - 

Trim  Condition  Altitude 

^  - . .  i  .■■i.  ..  . . . 

Mach 

asjsca,  .?nef"  "g;. 

(^nsp  ) 

' 

— 

1 

5000 

0.31 

- AM . 

0.58 

0.11 

0.04 

2 

5000 

0.90 

4.68 

0.28 

♦  ♦ 

3 

35000 

0.68 

1.92 

0.32 

0.08 

0.10 

4 

35000 

0.90 

2.11 

0.21 

0.02 

0.12 

5 

9800 

0.60 

2.77 

0.52 

_ 0.08 

0.07 

**  at  this  trim  condition,  the  airc;~aft  does  not  exhibit  a  phugoid  motion 


For  purposes  of  comparison,  the  values  of  natural  frequeDcy  and  daxTiping  ratio  for 
a  high  maneuverability  aircraft  in  nonterminal  flight  phases  can  be  found  in  the  military 
specification  regulation,  MIL-F-8785C  (1980)  This  regulation  requires  die  phugoid  mode 
to  have  a  damping  ratio  greater  than  0.04  and  the  short  period  dampixig  ratio  to  be  between 
U.35  and  1.30.  Moreover,  the  sh  irt  j^ieriod  must  have  a  natural  frequency  approximately 
bounded  by  1  and  10  radians  per  second,  depending  on  load  factor  and  angle-of- attack. 
Examining  Table  4.4  above,  the  NASA  aircraft  fails  to  meet  the  requirements  for 
longitudinal  motion  in  some  areas  of  the  flight  envelope.  However,  through  the  use  of  a 
control  system,  the  aircrafi  modes  cm  be  modified  to  meet  the  military  specification.*;.  For 
the  hybrid  control  law,  this  is  acc  omplished  by  seiccting  a  reference  model  that  meets  tiicr>e 
spiixifications. 

4.3.3  Aircraft  Refeience  Model 


As  discussed  in  Section  3.3,  the  reference  model  generate.s  the  desired  state 
trajectory  for  the  hybrid  controlled  aircraft  .nates.  During  the  o.'tH^ess  ot  selecting  a 
reference  (::!o:se  aftention  was  paid  to  ensurutg  that  fohewing  the  refererjce 

trajeett-sfie.:;  did  not  require  umeaiisric  couirol  actions.  Since  the  rate  and  fxisitio.u  of  the 
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hori2.«>nf ill  stabilator  is  limited,  unrealistic  demands  on  control  can  Jxanslate  into  either  rate 
of  position  sauiratiori  (e.g.,  an  inability  to  exercise  the  control  that  has  been  calculated  by 
the  control  law).  Control  saturation  leads  to  inadequate  performance  (i.e.,  fails  military  or 
other  specifications)  and  possibly  to  instabilities. 

The  reference  model  for  the  high  performance  aircraft  v/as  chosen  to  be  the  linear 
ck'sed  loop  system  that  results  from  applying  an  optimal  linear  control  design  to  the  open- 
loop  dynamics  linearized  about  a  selected  trim  condition.  For  the  experiments  in  Section 
4.2.5  and  4.2.6,  the  linearized  dynamics  at  trim  condition  5  (see  Table  4.4)  were  used. 
The  state  and  control  weights  for  the  quadratic  cost  function  used  by  the  optimal  control 
law  were  initially  selected  using  guidelines  suggested  by  Bryson  &  Ho  (1975)  and 
Kwakemaak  &  Sivan  (1972).  Trial  and  error  (based  on  simulations  of  the  linear  dynamics) 
were  used  to  arrive  at  the  final  cost  function.  Tne  natural  frequency  and  damping  ratios  for 
the  modes  of  the  closed-loop  reference  system  arc  listed  in  Table  4.5. 

Table  4*5  Reference  Model  Longitudinal  Modes 


Natural  Frequency  and  Damping  Ratio 


mm 

||||^^ 

1.46 

0.96 

0.75 

0.96 

Compared  to  the  open-loop  dynamics,  the  closed-loop  reference  model  has  modes 
diat  are  heavily  damped.  Moreover,  tire  natural  frequency  of  the  phugoid  is  much  higher  in 
the  closed-loop  system.  Tliis  reference  system  meets  military  specification  requirements. 

4.2.4  Application  Issues 

In  this  section,  the  application  of  the  hybrid  flight  control  law  to  the  high 
performance  aircraft  model  is  discussed.  Figure  4.16  illustrates  the  block  diagram 
representing  the  closed  loop  simulation  of  the  hybrid  controlled  aircraft  model  in  the 
NetSirn  sinjuiaiion  and  uesign  package. 
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Hybrid  ControHed  Bircraft  Simulation 


Reference  1 

utrffi 
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fl/C  Net"! 
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m 

R/C  Switch  1 

A7 

m 

Figure  4.16  Block  Diagram  of  the  Hybrid  Controlled  Aircraft  Simulation 

The  main  modules  in  Figure  4.16  represent  the  reference  model,  hybrid  controller, 
high  performance  aircraft  model,  and  the  linear-Gaussian  network.  The  function  of  the 
remaining  modules  is  to  modify  the  output  signals  (represented  by  the  connecting  arrows) 
passed  to  the  main  modules.  Again,  the  number  in  the  lower  left  comer  of  each  block 
dictates  the  order  of  execution  at  each  time  step.  Modules  that  are  called  more  than  once  per 
time  step  are  shown  with  multiple  sequence  numbers.  Tlie  following  paragraph  outlines 
the  principal  function  of  each  module. 

Random  is  the  first  module  that  is  executed.  It  generates  randomly  selected 
reference  commands  for  the  altitude  and  velocity  of  the  airci-aft  within  a  user-defined 
operating  range.  The  length  of  time  these  commands  are  held  constant  before  a  new  set  of 
reference  commands  is  generated  is  also  determined  by  the  user.  The  commands  are 
supplied  to  the  1st  Order  Sys  module.  This  module  processes  the  reference  commands 
with  a  user-defined,  rate-hmited  first-order  filter.  The  purpose  of  this  module  is  to  smooth 
the  step  comniands  generated  by  the  random  module,  effectively  outputting  a  smoothed 
ramp  to  a  step  command.  The  function  of  Reference  is  to  generate  the  esired  state 
trajectory  tliat  is  to  be  followed  by  the  hybrid  control  law.  The  reference  model  u  U  is  used 
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is  discussed  in  Section  4.2.1.  The  A/C  Switch  module  supplies  the  state  at  the  ctinrent 
time,  as  well  as  the  state  and  control  at  the  previous  time  step  to  the  network.  The  switch 
also  sends  a  flag  to  the  network  to  ensure  that  learning  only  occurs  with  states  and  controls 
that  are  at  consistent  times.  The  linear-Gaussian  network  in  the  hybrid  control  architecture 
is  contained  in  A/C  Net.  The  role  of  the  Multiplexor  (shown  with  sequence  numbers  6 
and  9)  is  to  store  the  output  of  the  learned  mapping  for  various  inputs  of  state  and  control 
required  by  the  hybrid  control  law.  Hybrid  executes  the  hybrid  control  law  developed  in 
Section  3.3.  The  complete  high  performance  aircraft  is  model  is  contained  in  the  Aircraft 
module. 

4.2.5  High  Performance  Aircraft  Experiment  1 

In  experiment  1,  the  aircraft  was  given  random  commands  for  altitude  and  velocity. 
More  specifically,  the  random  altitude  commands  were  beit\^'een  ±500  feet  and  tlie  random 
velocity  commands  were  between  ±10  feet  per  second.  As  discussed  in  Section  4.2.4,  the 
commands  are  filtered  by  a  rate  limited,  first-order  system.  The  rate  limits  for  altitude  and 
'  locity  were  set  to  50  feet  per  second  and  4  feet  per  second  per  second,  respectively.  The 
filtering  and  rate  limiting  is  intended  to  result  in  a  physically  feasible  reference  trajectory. 
The  initial  condition  for  the  aircraft  was  an  equilib  ium  condition  at  an  altitude  of  9800  feet 
and  velocity  of  539  feet  per  second  (trim  condition  5  in  Table  4.4).  For  each  new 
randomly  generated  command,  the  aircraft  was  reinitialized  to  this  same  trim  condition.  By 
randomly  selecting  commands,  the  objective  was  to  generate  state  trajectories  that  fully 
traverse  a  small  region  of  the  aircraft  operating  envelope. 

Similar  to  the  expreriments  involving  the  aeroelastic  oscillatoi,  the  linearized 
dynamics  of  the  aircraft  supplied  to  the  hybrid  controller  were  perturbed  from  their  actual 
values.  The  purpose  of  the  perturbations  was  to  increase  model  uncertainty,  a  feature  the 
hybrid  controller  is  able  to  accormnodate.  The  perturbations  to  the  dynamics  can  be  viewed 
as  a  situation  wherein  the  flight  control  system  is  provided  linearized  dynamics  that 
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represent  a  trim  condition  other  tlvan  that  for  which  the  manenvers  are  actually  to  talre  place. 
The  intent  is  to  illustrate  that  the  hybrid  controller  is  able  to  adequately  control  the  aircraft 
given  an  inaccurate  linear  representation,  indicating  that  less  accurate  a  priori  design 
information  is  needed  and  thereby  use  of  the  hybrid  controller  can  effectively  reduce  design 
costs. 

The  learning  component  used  in  the  hybrid  control  law  was  again  the  spatially 
localized  system  developed  in  Section  3.2.  For  this  case,  the  network  consisted  of  8  linear- 
Gaussian  nodes.  This  relatively  small  number  of  nodes  was  considered  to  be  sufficient  due 
to  the  modest  nonlinearities  expected  for  the  specified  class  of  reference  trajectories.  Of  the 
two  largest  factors  in  determining  the  nonlinearity  of  the  system,  angle-of-attack  and  Mach 
number,  only  anglc-of-attack  experiences  significant  changes  during  the  maneuvers 
f^sociated  with  these  icference  trajectories.  This  is  due  to  the  relatively  small  commanded 
changes  in  altitude  and  velocity  when  compared  to  the  flight  envelope,  and  thus  small 
changes  in  Mach  number.  The  centers  of  the  linear-Gaussian  nodes  were  arranged  in  a 
user-defined  grid  over  the  input  space,  with  the  highest  density  of  nodes  in  the  angle-of- 
aitack  dimension  (due  to  expected  nonlincaiities).  Moreover,  the  spatial  decay  of  each  node 
was  varied  as  a  function  of  the  center  location  of  its  nearest  neighbor.  The  closer  the 
neighboring  center,  the  higher  the  spatial  decay,  and  conversely,  the  farther  the  neighboring 
center,  the  lower  the  spatial  decay.  This  pattern  ensures  that  each  point  in  the  input  space 
can  be  adequately  mapped  to  tlie  desired  output  values.  Initial  values  for  the  slopes  and 
biases  of  the  linear-Gaussian  nodes  were  set  to  zero,  since  no  a  priori  design  information 
was  assumed.  Due  to  this  initialization  to  zero,  the  learning  system  does  not  impact  the 
stales  at  start-up  and  all  of  tlie  unknown  dynamics  aie  initially  faced  by  the  TDC  adaptive 
component.  After  evaluating  the  relative  magnitude  of  each  element  of  the  unknown 
dynamics  and  disturbance  vectors  supplied  by  the  adaptive  component,  the  cost  function 
(Equation  3.20)  was  weighted  to  ensure  all  errors  between  tlie  desired  output  and  actual 
network  output  have  the  same  significance.  Equation  (4,19)  demonstrates  how  the  cost 
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function  can  be  weighted  for  specific  errors  between  the  desired  and  actual  network  output; 

J  =  ~[d(x)  ~  f^(s.p)f  C[d(x)  -  f„„(x,p)]  (4.19) 

A' 

where  C  is  a  diagonal  matrix  with  user-supplies  weights  along  the  diagonal.  The  global 
learning  rate  (a)  was  selected  by  trial  and  error  in  order  to  find  the  highest  rate  of 
convergence  to  the  desired  output  with  adequate  accuracy  while  still  maintaining  a  static 
mapping  (i.e.,  one  for  which  parameters  arc  not  in  a  continuous  state  of  change). 

The  model  of  the  aircraft  dynamics,  which  is  in  a  continuous  time  form,  was 
integrated  at  50  Hertz  to  provide  a  balanced  tradeoff  between  numerical  stability  and 
processing  speed.  However,  tliC  contiol  signal  was  calculated  at  a  more  moderate  rate  of 
10  Hertz  in  order  to  reduce  the  real-time  sensing  and  computation  requirements  in 
deteimining  the  complete  state. 

After  rumiing  tiie  simulation  with  randomly  generated  commands  for  500  trials,  of 
20  seconds  each,  the  learning  system  was  able  to  build  a  mapping  of  a  significant  amount 
of  the  previously  unknown  dynamics.  Since  tlie  true  mapping  of  the  unknown  dynamics  is 
not  known  (in  contrast  to  the  case  with  the  acroelastic  oscillator),  the  cost,  as  defined  in 
Equation  (4,19),  is  used  as  a  measure  of  performance  of  the  learning  system.  Figure  4.17 
illustrates  both  the  initial  cost  and  the  cost  after  learning  after  500  trials  for  a  500  foot  climb 
and  simultaneous  10  feet  pti  second  increase  in  velocity. 
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Figure  4.17  Comparison  between  initial  cost  and  cost  after  500  trials. 

Since  the  cost  is  significantly  less  for  the  case  alter  learning,  this  ind'  ates  that  the 
learning  component  of  the  hybrid  law  has  built  a  mapping  of  a  significant  amount  of  the 
unknown  dynam’cs.  If  the  simulation  is  allowed  to  run  even  longer,  the  cost  will  further 
decrease.  However,  since  the  true  aircraft  model  dynamics  are  very  high  dimensional  and 
contains  states  that  are  not  included  as  inputs  to  the  network  (e  g.,  the  state  of  the 
actuators),  it  is  impossible  to  completely  leam  the  initially  unknown  dynamics.  For  this 
reason,  there  will  tilways  be  a  finite,  non-zero  cost. 

The  state  tiajectories  for  the  reference  model,  for  the  TEXT  controlled  aircraft,  and 
for  the  hybrid  controlled  aircraft  for  a  commanded  500  foot  climb  and  10  feet  {)er  second 
increase  in  velocity  arc  shown  in  Figures  4.18(a)  tlirough  4.23(a).  Since  the  difference 
between  these  trajectories  is  typically  small  compared  to  the  absolute  initial  trim  values,  the 
errors  between  the  desired  reference  and  the  actual  trajectory  for  both  the  TEHT  and  hybrid 
controlled  aircraft  arc  shown  in  Figures  4. 18(b)  thiough  4.23(b)  The  hof  zonlal  stabilator 
deflection  and  throttle  position  for  the  TDC  and  hybrid  controlled  aircraft  are  shown  in 


Cost  vs  Time 


a'3o 


ATTACHMENT  #2 


Fif'i  j;  s  4.24  and  4.25.  These  values  represent  the  actual  values  used  on  the  aircraft.  Due 
tv  j ;  j  jtor  dynamics,  tliese  actual  values  are  generally  not  the  commanded  output  calculated 
hy  the  given  control  law. 

As  illustrated  by  the  stale  trajectories,  the  hybrid  control  law  offers  improvements 
o  'V  I  the  TDC  contioUei  Although  the  errors  in  velocity  and  altitude  are  re.}atively  small, 
cr  c.rs  in  the  vehicle  rates  and  angles  are  significant  in  the  sense  that  oscillations  about  the 
rtTiirence  trajectory  are  reduced.  This  reduction  in  oscillations  for  the  hybrid  controlled 

has  the  potentisi;  to  change  a  response  that  vi/as  formerly  objectionable  to  the  pilot  to 
one  that  is  satisfactory.  Moreover,  the  horizontal  stabilator  deflection  for  the  hybrid 
controlled  aircraft  is  improved  over  that  of  the  TDC  controlled  aircraft  in  the  sense  that  the 
control  signal  is  less  oscillatory  (and  subsequently  less  taxing  on  the  actuators). 

The  trajectories  for  tlie  '•eference  model  and  the  hybrid  contrclled  aircraft  differ  for 
two  major  reasons.  The  first,  as  p.^viousiy  discussed,  is  the  inability  of  the  learning 
system  to  map  the  unknown  dynamics  for  c^aics  that  are  not  given  as  inputs  to  the  network 
(e  g  ,  actuaUt!;  stv.tes).  Perhaps  more  significant  arc  the  difficulties  associated  with 
attempting  to  covitrol  more  slates  than  there  arc  available  control  inputs.  Since  a  pscudo- 
invers**.  rnus  be  used  in  the  hybrid  contiol  law  when  tht  number  of  contro!.s  is  less  than  the 
nurnlKT  jf  ;t  lUs  as  discussed  in  Section  3.3,  the  tracking  of  ihc  complete  state  is  not 
guaravt.'"  •  /m  for  a  simulated  case  without  an>  unknown  dynamics  (Anderson  & 
Schmidi  (19  ' ■))  Due  to  this  inability  to  control  all  the  state  vai'iablc,s,  it  i.s  almost  certain 
that  differences  will  exist  between  the  reference  and  actual  trajcctoncs.  As  a  result,  errors; 
between  the  reference  trajectory  and  the  hybrid  controlled  trajectory  do  not  oexvvssarily 
represent  i  /  dlure  of  the  learning  system  to  map  tne  unknown  dynamics,  but  an  inability  t. 
control  rill  the  stales  to  a  reference  trajectory  with  a  limited  nu.inber  cf  controls. 
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Velocity  vs  Time 


Figure  4.19(a)  Velocity  trajectories  for  the  reference  model,  TDC  controlled 
aircraft,  and  hybrid  controlled  aircraft  after  500  trials. 
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Angle-of-Attack  \s  Time 


ire  4.20(a) 


Angle-of-attack  trajectories  for  the  reference  model,  TDC  controlled 
aircraft,  and  hybrid  controlled  aircraft  after  500  trials. 


Angie-of-Attack  Error  vs  Time 


Figure  4.20(b) 


Error  in  .;nglc-of-atUrck  lx:t\i'een  reference  trajectory  and  TIX"  or 
hybr'  <  controlled  a  rcraft. 
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Pitch  Angle  vs  Time 
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Figure  4.21(a)  Pitch  angle  trajectories  for  the  reference  model,  TDC  controlled 
aircraft,  and  hybrid  controlled  aircraft  after  500  trials. 
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Figun-  4.21(b)  Error  in  pitch  angle  between  refcrenc.  trajectory  and  TFX'  or  hybn 

controlled  aircraft. 
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Lire  4.22(a)  Altitude  trajectories  for  the  reference  model,  TDC  controlled  aircraft, 
and  hybrid  controlled  aircraft  after  500  trialls. 
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Figure  4.22(b)  Error  in  altitude  between  reftrence  trajeetory  ;ind  IIX"  or  hybrid 

controlled  airc  raft. 
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Thrust  vs  Titfiie 


1 - 1 - T" 

Reference:  Thrust 

.  TDClTirust 

Hybrid  Thrust 


Time  (sec) 


Thrust  trajectories  for  the  reference  model,  TDC  controlled  aircraft, 
and  hybrid  controlled  aircraft  after  500  trials. 
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Nonetheless,  experiment  1  demonstrates  that  the  hybrid  contioller  is  able  to 
improve  tlie  performance  of  the  aircraft  over  the  purely  adaptive  TDC  controller.  This 
improved  perfonnance  is  realized  by  exploiting  the  learned  functional  mapping  of  the 
previously  unknown  model  dynamics  to  remove  the  delay  associated  with  the  adaptive 
component  and  reduce  the  model  uncertainty  to  arrive  at  a  superior  nonlinear  control  law. 
The  next  experiment  illustrates  the  ability  to  generalize  the  synthesized  mapping  to  a  larger 
input  space  generated  by  using  a  more  demanding  commanded  altitude  rate. 

4,2.6  High  Performance  Aircraft  Experiment  2 

lire  objective  of  experiment  2  is  to  demonstrate  the  local  generalization  abilities  of 
the  learned  functional  mapping  to  areas  of  the  input  space  that  have  not  explicitly  been 
trained.  By  increasing  the  .rate  limit  on  the  randomly  generated  altitude  command  to  100 
feet  per  second,  the  region  of  the  input  space  for  which  controls  must  be  computed  is 
effectively  increased.  Moreover,  the  reference  trajectory  is  more  demanding  in  the  sense 
that  larger  controls  (resulting  in  larger  angles  and  angular  rates)  are  required  to  follow  this 
trajernory. 

BeyoTid  the  increased  altitiide  rate  limit,  the  setup  of  experiment  2  is  identical  to 
experiment  1  in  terms  of  the  learning  system,  initialization,  and  control  calculation  rate. 
Figures  4.26  through  4.31  contain  the  state  trajectories  for  the  reference  model,  TDC 
contmlied  aircraft,  and  hybrid  controlled  aircraft  for  a  commanded  500  foot  climb  (at  a  100 
feet  per  second  rate)  and  IG  feet  per  second  increase  in  velocity  using  the  previously 
traiiitd  network  in  experiment  1.  The  horizontal  stabilator  and  thiottle  position  applied  to 
the  aircraft  for  both  the  TDC  and  hybrid  responses  are  shown  in  Figures  4.32  and  4.33. 
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Pitch  Rate  vs  Time 
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are  4.26  Pitch  rate  trajectories  for  the  reference  model,  TDC  controlled  aircraft,  and 
hybrid  controlled  aircraft  using  the  network  learned  in  experiment  1. 
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Figure  4.27  Velocity  trajectorieb-  for  the  reference  model,  TDC  controlled  aircraft,  and 
hybrid  controlled  aircraft  asing  the  network  learned  in  experiment  1. 
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Angle-of>Attack  vs  Time 


Time  (sec) 

Figure  4.28  Angle-of-attack  irajectories  for  the  reference  model,  TDC  controlled  aircraft, 
and  hybrid  controlled  aircraft  using  the  network  learned  in  experiment  1 
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Figure  4.29  Piicli  angle  trajectories  for  the  reference  model,  TDC  controlled  aircraft,  and 
hybrid  controlled  aircraft  using  the  network  leanred  in  experiment  1 . 
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Altitude  vs  Time 


Figure  4.30  Altitude  trajectories  for  the  reference  model,  TEXT!  controlled  aircraft,  and 
hybrid  controUed  aircraft  using  the  network  learned  in  exp  eriment  1. 
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Figure  4.31  ITirust  trajectories  for  the  reference  model,  TLXT  controlled  aiict  aft,  and 
hybnd  controlled  lurcraft  using  the  network  learned  in  expenrnent  1. 
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As  illustrated  by  the  state  trajectories,  the  TDC  control  law  is  unable  to  provide  the 
control  necessary  to  reach  the  commanded  states,  whereas  the  hybrid  controlled  aircraft 
generally  follows  the  reference  trajectory.  Moreover,  the  banging  of  the  horizontal 
stabilator  and  throttle  against  their  limits  for  the  TDC  controller  illustrates  a  desperate 
attempt  to  regain  the  desired  state  trajectory.  This  failure  of  TDC  demonstrates  the 
consequences  of  not  usmg  experientially  gained  knowledge  to  remove  the  delay  in  the 
estimate  in  the  unknown  dynamics  and  an  inability  to  accommodate  model  uncertainty 
(e.g.,  improve  the  a  priori  estimate  of  the  control  weighting  matrix). 

Experiment  2  also  demonstrates  the  ability  of  the  learning  system  to  generalize  to 
nearby  regions  of  the  input  space  for  which  it  has  not  explicitly  received  training  samples. 
This  feature  >s  especially  important  due  to  that  fact  that  the  hybrid  control  hw  uses  a 
passive  learning  system.  Under  passive  learning,  the  learning  system  does  not  guide  tlic 
vehicle  in  an  active  search  of  the  input  space.  Instead,  the  learning  system  is  opportunistic 
in  the  sense  that  it  learns  for  a  given  region  of  the  input  space  presented  by  tlie  adaptive 
controller  for  the  state  trajectories  that  have  been  flown.  As  a  result,  areas;  of  the  input 
space  in  which  TDC  in  unable  to  traverse  can  not  initially  receive  training  information. 
However,  due  to  generalization,  the  hybrid  controller  is  able  to  stal  ilize  and  control  the 
aircraft  in  areas  the  purely  adaptive  control  law  fails.  Later  e  ursions  through  these 
regions  will  provide  additional  inputs  for  tne  learning  system  to  j  -;,  cess.  This,  of  course, 
suggests  a  conservative  approach  to  flight  testing  /  lemming  if  the  '  y brid  controller  weie  to 
be  employed.  Since  the  hybrid  controller  is  able  to  adequately  c  i  j  v!  the  aiicraft  given  ac 
inaccurate  linear  representation,  less  a  p-iori  design  infonna'i*  n  is  needed  (i.e.,  fewer 
design  point  linearizations),  effectively  reducing  design  js.  >  ,d  automating  the  tunini’ 
process. 
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5  CONCLUSIONS  AND  RFXOMMENDATIONS 


5 . 1  SUMNfiVRY  AND  CONCLUSIONS 

Diis  thesis  describes  the  development  and  application  of  a  hybrid  conhol  system  to 
the  problem  of  flight  control  for  a  high  performance  aircraft.  By  combining  an  adaptive 
component  based  on  the  TDC  approach  with  a  learning  system,  an  Innovative  new  hybrid 
controller  has  been  formed  that  allows  each  of  these  two  mechanisms  to  focus  on  the 
control  objective  for  which  it  is  best  suited.  The  adaptive  component  of  the  hybrid 
controller  accommodates  some  of  the  unmodeleil  dynamics  and  provides  estimates  of  ariy 
vnmodeled  state  dependent  dynamic  behavior  to  the  learning  system.  The  connectionist 
leaihifl't  system  synthesizes  a  functional  approximation  of  tiie  state  dependent  dynamic 
beha.'ic;.  Using  this  learned  mapping,  the  hybrid  control  system  is  able  to  predict  state 
depende.nt  behavior,  effectively  removing  the  delay  an  adaptive  controller  exwriences  due 
to  its  reactive  nature. 

The  impact  of  a  controller  (hat  has  the  ability  to  anticipate  vehicle  behavior  has  been 
illustrated  in  tenns  of  improved  closed  -loop  aircraft  ix;rformance.  It  ha  ;  also  been  shown 
that  by  using  derivative  information  from  die  learned  mapping,  model  oncerlainry  could  be 
reduced  at  each  o}>crating  condition,  essentially  automating  the  tuning  process  normally 
as.sc)ciated  with  gain  scheduled  controllers.  Due  to  its  ability  to  reduce  model  uncertainty, 
the  hybrid  system  adequately  controls  the  aircraft  esen  in  situations  where  an  inaccurate 
linear  representation  was  used  as  the  system  rncxlci  during  the  initial  design  of  the  control 
law,  A.S  a  result,  lc.ss  a  prion  de.sign  information  is  needed  (i.e.,  fewer  design  j  c.nt 
linean/ations),  effecuvely  reducing  design  co,  ts. 
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This  th.esis  has  also  demonstrated  the  ability  of  a  spatially  localized  learning  system 
to  synthesize  a  nonlinear,  multivariable  mapping  in  a  control  environment.  More 
specifically,  it  has  been  shown  that  a  linear-Gaussian  network  is  able  to  leam  a  functional 
approximatiorx  of  the  initially  unknown  dynamics,  given  state  and  control  information, 
using  an  incremental  learning  approach, 

5 . 2  RECOMMENDATIONS  FOR  RJTURE  WORK 

The  major  constrair  to  the  amount  of  improvement  the  hybrid  control  system  could 
reahze  was  not  a  function  *  if  the  unknown  dynamics  or  the  ability  of  the  learning  system  to 
synthesize  tliis  mapping,  hut  the  requirement  tliat  all  the  states  follow  their  given  reference 
trajectory.  Since  aircraft  havf  more  states  than  controls,  this  requirement  is  unrealistic  from 
the  control  standpoLit.  Me  cover,  in  many  cases,  only  a  few  of  the  states  are  of  din'ot 
importance.  Further  research  following  (Anderson  &  Schmidt  (1990))  should  focus  on 
reducing  the  number  of  ci  itrolled  states  to  be  the  same  as  the  number  of  control  inputs 
Using  this  appro  '  :h,  he  p*  eudo-inverse  rcqui"ed  in  tiie  derivation  of  the  hybrid  control  law 
would  be  repiac.;d  b  ■  a  true  inverse,  essentially  allowing  perfect  model  f  illowing  for  the 
case  where  a)i  of  the  u'i:ially  unknown  dynamics  arc  learned  and  thcr  is  no  state  and 
control  observadoEi  noiso.. 

Ano?  :.;  r  area  for  future  work,  is  the  expansion  if  the  hybnd  control  system  to  map 
the  entire  fight  envelope,  as  compared  to  a  small  subset  of  trajectories.  This  research 
would  rcqi.in.-  a  much  larger  network  than  that  used  for  the  experiments  in  this  thesis,  due 
lo  the  exj  ted  nonlinearities  in  Mach  numtv.'r  as  well  as  angle-of  attack.  A  thorough 
cxaniiiiauo  i  of  the  abilitie.s  of  the  hybrid  control  law  trained  over  the  enure  llight  cnvdofic 
could  further  highlight  ihe  advantages  of  this  learning  enhanced  controller  .ivei 
conventional  techniques. 

A  hiture  investigation  into  using  ilif  .-reni  ty{X‘s  o'  adaptive  cO)ni(H)ncnt,s  (i.e.,  ether 
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than  TDC)  in  the  hybrid  control  law  is  recomniended  (Astrom  &  Wittenmark  (1989), 
Slotine  &  Li  (1991)).  Moreover,  future  research  should  examine  areas  of  automatic  flight 
control  o^her  than  autopilots  (e.g.,  stability  augmentation  systems  and  control 
augmentations  systems)  where  the  hybrid  control  law  offers  potential  improvements. 
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ABSTRACT 


Learning  systems  represerst  nn  approach  to  optimal  control  law  design  fo,; 
situations  Tvhere  initial  troael  uncertainty  precludes  the  i<se  of  robust,  fixed  control 
laws,  This  thesis  analyzes  a  variety  of  te<*Jmiques  for  the  incremental  synthesis 
of  optimal  control  laws,  where  the,  descripujr  incrr.rtie.ninl  implies  that  an  on-line 
implementation  filiei'S  the  information  acquired  through  iea!-ti:me  interactions  with 
the  plant  imd  the  operating  enviroiunent.  A  direci/indirtci  franiework  is  proposed 
as  a  means  of  classifying  app’-oaclies  to  learning  optimal  control  laws.  Within  this 
ffamework,  relationships  amo.ug  exjsting  direct  algorithms  are  examined,  and  a 
specific  class  of  indirect  control  laws  is  developed. 

Direct  learning  conti-ol  implies  that  the  feedback  loop  that  motivates  the  learn- 
iug  process  is  dos<xi  around  system  performance.  Reinforcement  learning  is  a  type  of 
direct  learning  technique  with  origins  in  the  prediction  of  animal  leaxTiing  phenom¬ 
ena  tliat  is  largely  restricted  to  discrete  input  and  output  si.>aces.  Three  algorithms 
that  employ  the  concept  of  reinforcement  learning  are  p, resented:  the  Associative 
Control  Process,  Q  learning,  end  the  Adaptive  Ecmristjc  Critic. 

Indirect  learning  control  denotes  a  class  of  incremental  control  law  synthesis 
methods  for  which  the  learning  loop  is  closed  aroimd  the  sj-stem  model.  The  ap¬ 
proach  discussvid  in  this  thesis  mtegrt  '  es  information  from  a  learned  mapping  of  the 
initially  uninodeled  dynamics  into  finite  horizon  optimaJ  control  laws.  Therefore, 
the  derivation  of  the  control  law  structru'e  rs  well  as  the  closed- loop  performance 
remain  largely  external  to  the  learning  process.  Selection  of  r  method  to  approxi¬ 
mate  the  nonlinear  functj»>n  that  repft«ents  the  initially  \inmodelcd  djmamics  is  a 
separat-’.  issue  .:>,t  explicitly  addressed  in  this  thesis. 

Dyna^nic  programming  and  differeiitial  dymmiic  progrrmming  are  reviewed 
to  illustrate  how  leaniing  methods  relate  to  these  classical  approacht*  to  optimal 
control  design. 

The  acroC,ast:c  oscillator  is  a  two  .state  mass-spring-dashpot  sv&tem  excited 
hy  a  at>n linear  lift  force.  Several  leiwning  control  algorithms  are  applied  to  the 
acroelasiic  oscillator  to  either  regulate  tlie  mass  position  about  a  commanded  point 
or  tr  trjiclf  a  p<j2itioi',  refeieoce  trajs^rtory;  the  advardagis?  and  dis,’»d vantages  of 
these  t'lgo.rit’am.‘i  are  cii.scus.sed. 
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Chapter  1 


Introduction 


1.1  Problem  Statement 


The  primary  objective  of  this  thesis  is  to  incrementally  S}athesue  a  nonlin¬ 
ear  optima]  control  law,  thrcnigh  real  time,  do:  .  d-loop  interactions  betwe«i  the 
clynaiiiic  system,  its  environment,  and  a  leariiinf^  system,  when  substantial  initial 
model  uncertainty  exists.  The  d>marnic  system  is  as:iuuat“d  to  l>e  nouhnear,  time- 
Invariiuit,  ;md  of  known  state  dimen.si'.m,  but  ottierwise  oaJ\  iaacc  xuateiy  described 
by  an  a  j>riori  model,  d'he  problei.r,,  tl.«’r«-fore,  requires  either  explicit  or  miplicii 
system  identification.  No  disturbanix':-,  uoiive,  or  other  tune  varying,  dynarmes  ex 
1st.  .flic  oiUimal  control  law  i.s  assuiiieii  to  .•vi.enmv'  iUi  eva.hiatiwn  of  the  state 
t'.ajr'ctory  and  the  control  sequence,  tor  <my  imt.uU  condition 
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Chapter  1  -  Introduction 

1.2  Thesis  Overview 

One  objccti  w  of  this  thesis  is  to  present  an  investigation  of  several  approaches 
for  incrementally  synithesi/irsg  (on-line)  sue  optimal  control  law,  A  second  objec¬ 
tive  is  to  propose  a  dirtc* /indirect  framewoik,  with  v/bich  to  distinguish  learning 
algorithms.  This  fi:an:j3v/ork  subsumes  concepts  such  as  supervised/unsupervised, 
learning  and  reinforcement  kaxning,  which  are  not  directly  related  to  control  law 
synthes3,i.  This  thesis  unifies  a  variety  of  concepts  from  control  theory  and  behav¬ 
ioral  science  (where  the  learning  process  has  been  considered  extensively)  by  pre¬ 
senting  two  different  learning  algorithm.s  applieil  to  the  same  control  problem:  the 
Associative  Control  Process  (ACP)  algoritiun  [14j,  which  was  initially  developed  to 
predict  animal  behavior,  and  Q  learning  [16],  which  derives  from  the  mathematical 
theory  of  value  iteration. 

The  aeroelastic  oscillator  (§2),  a  two-state  physical  system  that  exhibits  inter¬ 
esting  nonlinear  dynamics,  is  used  throughout  the  thesis  to  evaluate  different  control 
algorithms  which  inrr)rporate  learning.  The  algorithms  that  are  explored  in  §3,  §4, 
md  §5  do  not  explicitly  employ  dynamic  models  of  the  system  and,  therefore,  may 
be  categorized  as  direct  methods  of  learning  an  optimed  control  law.  In  contrast,  §6 
develops  lun  indirect,  model-based,  approach  to  learning  an  optimal  control  law. 

The  Associative  Control  Process  is  a  specific  reinforcement  learning  algorithm 
applied  to  optimal  control,  and  a  descriptiim  of  the  ACP  in  §3  introduces  the 
(Oucept  of  direct  lenruiiig  of  an  optimal  co.  '»-cl  lawc  The  ACP,  which  includes 
a  jiroimnent  network  arcliitecture,  originated  in  the  studies  of  animal  belrivior, 
f  he  Q  lejiriiiiig  algorithm,  which  derives  from  the  mathematical  thtx'rems  of  policy 
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1,2  Thesis  Overview 

iteration  and  value  iteration,  is  a  simple  reinforcement  leai'niug  rule  independent 
of  a  network  architecture  and  of  biological  origins.  Interestingly,  Kiopr  s  ACP 
[14]  may  be  reduced  so  that  the  resulting  system  accomplishes  Watkins’  Q  learning 
algorithm  [16].  Sutton’s  theory  of  the  temporal  difference  methods  [15],  presented  in 
§5,  subsumes  the  ACP  and  Q  learning  algorithms  by  generalizing  the  reinforcement 
learning  paradigm  applied  to  optimal  control. 

Several  control  laws  that  are  optimal  with  respect  to  various  finite  horizon  c«)«t 
functionals  are  derived  in  §6  to  intn>duce  the  indirect  approach  to  learoing  optimal 
controls.  The  structure  of  the  control  laws  with  and  without  learning  augmentation 
appears  for  several  cost  functionals,  to  illustrate  the  manner  in  which  learning  may 
augment  a  fixed  parameter  control  design. 

Finally,  dynamic  programming  (DP)  imd  differential  dynamic,  programming 
(DDP)  are  reviewed  in  Appendix  A  as  classical,  alternative  methods  for  synthesizing 
optimal  controls.  DDP  is  not  restricted  to  operations  in  a  discrete  input  space 
and  discrete  output  space.  The  DP  and  DDP  algorithms  are  model-based  and, 
therefore,  learning  may  be  introduced  by  explicitly  improving  the  a  priori  model, 
’■e..ult’ng  in  an  indirect  learning  optimal  controller.  However,  neither  DP  nor  DDP 
is  easily  implemented  on-Une.  Additionally,  DDP  does  not  address  the  problem  of 
synthesizing  a  control  'aw  over  the  full  state  space. 
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Cltuptef  ?  '  Introdwcfion 


1.3  Conceptf? 


The  primary  job  of  an  automatic  controller  is  to  manipulate  the  inputs  of  a 
dynamic  system  so  that  the  system  oehavior  satisfies  the  stability  and  perfonnance 
specifications  which  constitute  the  control  objective.  The  design  of  such  a  control 
law  may  involve  numero\is  difBctiltiea,  including  multivariable,  nonlinear,  and  time 
varying  dynamics,  with  many  degrees  of  fi  *dom.  Further  design  challenges  arise 
from  the  existence  of  model  uncertainty,  disturbances  and  noise,  complex  objective 
function.',  operational  constraints,  and  the  possibility  of  compoaeiit  failme.  An 
examination  of  the  literature  reveals  that  control  design  metho<lologies  typically 
address  a  subset  of  these  issues  while  making  simplifying  assumptions  to  satisfy  the 
remainder  --  a  norm  to  which  this  thesis  conforms. 

This  section  is  intended  to  introduce  the  reader  to  some  of  the  relevant  issues 
hy  previewing  concei)ts  that  appear  tliroughout  the  thesis  and  are  peculiar  to  learn¬ 
ing  systems  and  control  law  development.  Additionally,  this  section  motivates  the 
importamce  of  learning  control  research. 

1.3.1  Optimal  Control 

This  thesis  examines  methods  for  synthesizing  optimal  control  laws,  the  ob¬ 
jective  of  which  is  to  extremize  a  scalar  functional  evaluation  of  the  state  trajectory 
and  control  history.  The  solution  of  an  optimal  control  problem  generally  requires 
the  solution  of  a  constrained  (>ptimization  problem;  the  calculus  of  variatiorni  .aiid 
dynamic  prograxuriiing  address  this  issue.  However,  an  optimal  control  rule  may  be 
evaluatea  by  these  methods  only  if  an  acc.urate  model  of  the  dynamics  is  available. 
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In  the  absence  of  a  complete  and  accurate  a  priori  model,  these  approaciies  raay 
be  applied  to  a  model  that  is  derived  through  observed  objective  fux;  ::tion  evnlraa- 
tions  and  state  transitions;  this  constitutes  indirect  learning  control.  Alternatively, 
in  environments  with  substantial  initial  uncertainty,  direct  learning  control  can  bt': 
considered  to  perform  incremental  dynamic  progreimming  without  explicitly  esti¬ 
mating  a  system  model  [1]. 

1.3.2  Fixed  and  Adjustable  Control 

Most  control  laws  may  be  classified  into  one  of  two  broad  categories:  fixed  or 
adjustable.  The  constant  parameters  of  fixed  control  designs  are  selected  using  aa 
a  priori  model  of  the  plant  dynamics.  As  a  resnilt,  stability  robustness  to  modeling 
uncertainty  is  potentially  traded  ag2dnst  performance;  the  attainable  performance 
of  the  closed-loop  system  is  limited  by  the  accuracy  of  the  a  priori  description 
of  the  equations  of  motion  and  statistical  descriptions  of  noise  .and  disturl  mces. 
Adjustable  control  laws  incorporate  real-time  data  to  reduce,  either  explicitly  or 
implicitly,  model  uncertainty,  with  the  intention  of  improving  the  closed-loop  re- 

.SpOIlS<i. 

An  adjustable  control  design  becomes  necessary  in  environments  where  the 
controller  must  oper'vte  in  uncertain  conditions  or  when  a  fixed  paraoneter  ccmtrol 
law  chat  pcifoiius  sufficiently  well  cannot  he  designed  from  the  limited  a  priori 
nihcrmation.  The  two  uiam  classt^s  of  adjustrblc  control  are  adaptation  and  learn¬ 
ing,;  both,  reduce  ttie  ieve!  oi  u.nrertainty  by  filtering  eaii{)ii:icaJ  data  tPat  is  gained 
exyierieiitially  [2, 
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Adaptive  CoritroS 


iTois^  aiid  diaturbiaices,  wkicJi  aj-e  present  in  al'  real  systems,  represent  the 
vxnprediciable,  time  dependent  featme?  of  the  dynamics.  HoaHjiiearities  and  cou- 
pled  dyi'^amia.,  wiiicli  are  predirt,  ibie,  snarval  Axnctions,  constitute  the  remaining 
model  errors,^  .Adaptive  control  fcechni.iues  fe4.;C.t  to  dynaroJ-cg  chat  appear  to  be 
time  '/arjing,  ■vtiult  leammg  controllers  progressively  acquire  spatiEdiy  dependent 
kno9.'j,edge  about  unmodeled  dyn?imics.  ThiiC  fundamentEil  'lifFerence  in  focus  allows 
learning  j.yst.eTns  to  avoid  several  deticiencies  exiubited  by  adaptive  .algorithms  in 
accoiamodatiixg  model  tixciu.  \Mier.cvet  the  plant  operating  condition  changes,  a 
new  region  of  the  aonliiiaar  dyiiamics  may  be  encotmte.ce;!,  A  luemoryless  adaptive 
conivol  method  loxist  reactively  adjust  the  control  law  p-’irameters  after  observing 
the  syt^tcm  behavior  for  the  current  condition,  even  if  that  operating  condition  hes 
been  previouisly  experienced.  The  transient  effects  of  frequently  adapting  control 
paTiuneters  iiia>  degrade  clo.sed~Joop  performance.  A  learning  system,  which  utilizes 
memory  to  recall  the  appropriate  control  paramef  eis  as  a  iimction  of  the  opera  ting 
condition  or  state  of  the  system,  may  be  chaiacterii.ed  as  predictive  rather  than 
reactive. 

Adaptive  control  exists  in  two  flavors:  direct  and  imlirtict.  hadirect  adaptive 
control  methods  calculate  control  ac lions  from  an  expliafc  .model  of  the  system, 
vduch  is  enhanced  with  respect  to  the  a  priori  description  through  a  system  ide.,a- 
t.ificatjon  procedure.  Detect  adaptive  control  methods  !.nodif_v  the  co.  itroj  syste,m 
parameteis  without  explicitly  developing  improvements  in  the  inititd  ,,ystcin  model. 

The  term  spitital  impjcs  a  furictioa  that  doci,  aot  explicitly  depend  ou  vinuc 
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Wliile  direct  adaptive  techniques  to  perform  regulation  and  tracking  are  well  estab¬ 
lished,  adaptive  optimal  controllers  are  primarily  indirect. 

1.3.4  Learning  Control 

A  learning  control  system  is  characterized  by  the  automatic  synthesis  of  a 
functional  mapping  through  the  filtering  of  information  acquired  during  previous 
real-time  interactions  with  the  plant  and  operating  environment  [2],  With  the 
availability  of  additional  experience,  the  mapping  of  appropriate  control  actions  as 
a  function  of  state  or  the  mapping  of  unmodeled  dyncimics  as  a  function  of  state  and 
control,  is  incrementally  improved.  A  learning  system,  which  is  implemented  using 
a  general  function  approximation  scheme,  may  either  augment  traditional  fixed  or 
adaptive  control  designs,  or  may  operate  independently. 

1.3.5  Generalization  and  Locality  in  Learning 

Generalization  in  a  paranaeterized,  continuous  mapping  imphes  that  each  ad¬ 
just  abl**  paxajneter  influences  the  mapping  over  a  region  of  non-zero  measure  [4]. 
The  effect  of  generalization  in  function  synthesis  is  to  provide  automatic  interpola¬ 
tion  betweer  training  data.  If  the  plant  dynamics  are  continuous  functions  of  time 
and  state,  then  the  control  Law  will  also  be  continuous.  Therefore,  the  \txiidity  of 
generalization  follows  directly  horn  the  continuity  of  the  dynamics  and  the  de,sired 
control  law  [2]. 

The  ..concept  of  locality  of  learning 's  related  to  generalization,  but  differs  in 
scope'.  Locality  of  leaLriiing  irni)li<*s  that  a  chaj.ige  ii>  any  single  adju.stable  parraiictr'r 
will  only  alter  the  mapped  function  ovei  a  k>cali./,ed  region  of  the  input  space.  For 
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a,ori"loc,'ilb;e'd  learniiig,  exten«i''.'e  trsdnirui;  in  a  restricted  repon  of  the  input  space, 
whidhi  might  >0  :ciir  a  systeia:.  it?  regul.ated  about  a  trim,  cionciitiozi ,  can  ccrnipt 
the  previous].}.'  acquirfsil  majjpiag  for  other  .regions  '’fherefoire,  on-lint  iesmiing  for 
which.  tr,ai.ning  sanaples  may  be  cojaceiQtr.yi  ''<1.  in  .a  s  -x  h  ’  regiioji  of  the  .mput  spu.ee, 
requires  the  locality  attribute  [2,3,4]. 


1.3.6  Supervi'sed  and  Idnstipervisved  .Lear,ul,!Jg 


L^atTAing  piocedores  may  be  distinguished  as  .supervised  '  '.r  uii.S'Upervised  ac¬ 
cording  to  the  tyj)e  of  instr^iciionaJ  information  p.rovi<ie.d  by  th>'  environment.  A 
super'rjs<:Hl  learning  controikr  requires  both  s.  teacher  p.rorides  t.he  dvisired  sys  ■ 
tern  response  and  the  cost  foiAclional  which  depends  on  fc.he  syste.ui  outp'ut  error 
[5],  Supervised,  learning  c:o,ntrol  systems  offer,  form  the  etitw,-  signal  by  co.mparing 
meaji'ured  sysfcei.n  cbajfircteristics  with  predictions  generated  by  avj  intei-n.al  modr'i. 
The  supervised  learni.ng  process  eviduatei.  how  eacl;  <.uijnstfi,l>.l.e  pavameter,  witliiji 
the  internal  ro.odei,  iiifiue,tices  t.he  encj.r  sqrMa!.. 

The  class  cf  virj.!n.spervjsed  control  <!e;-  leams  t.tiroug'h  a  sctJar  evaluative 
feedba<-Ji:  signal,  sm,!;.'  as  the  measric.  r-iene! .atv’-ri  by  a.  c.':.«vf.  .funet.ion,  that  is  less 
informative  than  the  gradieiAt  ve''.t.ot  .>1  the  cost  with  to  eacli  adjnsl.abie 

parameter.  Tins  type  of  learning  is  also  referred  u  a.a  Irani. vjc  wuh  a  cru  it  .  'i'hc 
scalrAi'  e”’ahiation  which  acciues  from  pcvlonrung  a  \  action  in  srat.t'  ah,.c:s  n-av  nidi 


cate  (lie  ‘  f  ;.o  j:>e''ferm  any  otiicr  .a 

evion  in  that  srate.  Them  for 

e,  *'  ■  n  1 

0  c 

if  onlv  two  possiiile  actions,  ajii  ev,..!: 

lative  learning  sigmi  con'.aii 

is  SIS  iiii( 

:'  s.ady  : 

a.h.>rn!a!.iOU  than  the  feedback  rcqc) 

fail  by  a  s\5|icfvised  h'axi.imi'; 

aJEs.')  iti:i 

m  [h]. 
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1.3.7  Direct  and  Indirect  Learning 

The  classifiers  direct  and  indirect  learning  are  borrowed  from  the  concept  of 
direct  versus  indirect  adaptive  control.  Direct  learning  control  implies  the  feedback 
loop  that  motivates  the  learning  process  is  closed  around  system  performance.  In 
direct  learning  control  denotes  that  the  learning  loop  is  dosed  around  the  systeiu 
model.  Virbereas  in  ^3  -  §5  the  learning  process  is  closed  aroirnd  system  performance, 
in  §6,  the  leanaing  loop  is  closed  around  system  model  improvement,  leaving  the 
control  law  derivation  and  resulting  system  performance  “open-loop.” 

Direct  ieiixning  appniachies  to  optimal  control  law  sjmthesis,  which  employ  re¬ 
inforcement  learniiig  techniques,  axe  not  readily  applicabk  to  the  reference  model 
tiacicing  problem.  The  adjustable  patjuneters  in  a  reinforcement  learning  method 
enc  .de  an  evaluation  of  the  cost  to  complete  the  obj(  ctive  In  a  t  addng  envi- 
roniuent,  the  command  objective  chaiiges  anc’  future  dues  may  not  be  known, 
fherefore,  the:  cost  to  complete  the  objective  clu  nges  and  the  apphcatioi  of  meth¬ 
ods  from,  §3  •  §5  is  restricted  to  regulation. 

Indirect  ieju“mtig  approaches  to  optir-.taJ  co:  rol  pi  iraarily  empilcy  supt  rviaed 
itai'iii.trg  algoritliias.  In  contrast,  direct  leainii^g  luethods  for  optimal  control  law 
synthesis  {,uii! cipaJIy  employ  unsiipervj,'3e<i  learning  algorithms. 

1 .3.8  ,R.ejMforv'ins-ei.\t  L-e.arni.og 

! d;  igii.ia.!ly  conceived  in  the  .study  of  iuiimai  learning  phen'iinc.  ua,  reinforce 
.Uicnt  leanci.ng;  ,is  a  type  of  un.‘-.upe''vised  learning  that  le^UHiruis  to  pei foi xuau-..e 
riioyoiuic  .feedback  signaJ  referred  to  as  the  roii  irceuicat  which  n.i;yy  feprosmit  a  rc 


ATTACHMENT  3 


<1*.- :«r  1  -  Introductiou 

w  u  ’ '  > '  a  cost.  At  each  discrete  time  step,  the  controller  observes  the  current  state, 
ek  e  ci  '.oUej  an  action,  observes  the  subsequent  state,  and  receives  reinforce- 
u  tii  ntxol  objective  is  to  maximize  the  expected  sum  cf  discounted  future 
.  "uii>  i>  -li  i  The  probability  of  choosing  an  action  that  yields  a  large  discounted 
fu  lu  re  oforc.  aent  should  be  increased;  actions  that  lead  to  small  discounted  fu- 
tui'i  t(  inio;  “erieii t  should  be  selected  lefis  frequently  [1].  Reinforcement  learning 
metho  is  o»'!  ti  acquire  successful  action  sequences  by  constructing  two  complimen- 
t{  T  y-  Ui  acti  as;  a  policy  function  maps  the  states  into  appropriaf  *  control  actions 
iTid  a  v-'  JUyition  function  maps  the  states  iutc  ^xpectation8  of  the  ojscounted 
T  ‘  dbi  mient 

V.  ii  tudy  of  connectiomst  learning  methods  hrus  evolved  from  research  in 
k  f  ,  (  .*3  to  thw  ies  founded  in  the  estabii  hed  discip  lines  of  function 

ap  '  diiu  'n  ,  'I  '  'timi/  ition,  reinforcement  ieariniiit  has  bt  en  demonstrated  to 

he  a  at  hniuue  los  s  Iving  some  stable,  ijoiiHik  ;u\  o)  ti  nji.!  control  pmoblema 


1  -I  Miieiu  1.'..  T;  'Jig  addiesses  tl  credit  assigiiineut  pjobiei  i,  vvhic.i'  refers 
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reinli.irccnient 
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}  a  'S.  -iii  i.:  i.r.  •  (-u'Ujin  .nr  i  r  yiissivc  and  ,u:iive  learumg.  i*ass;.'r 

1' Ui  iiui  i  u;cK  1  ;-•  >(.))  ii  Ui  jii:  ?  -  :  ■  u ;  f  s :  Li  11  LUi'  i  i  ill '1 1  nation  that  hcci.iiiies  aviu  ■ 
'!e  '!  ii  ■  i).  s  H.in  of  vh  ’  !<'  J  'oj  s\  U'l  -  ui  contiiisy  a  control  sysifU' 

Jig  o  'll  I  o  I  a  ic  sciif'nit  '  .  iriov  ■ . -‘i.s  fo  ya .!!  inforiiiaf  io'i  in  iegi  .nis  wlieic 

o  uidi  u!  k  :i;iu>  orrurn  !  :  i  f  ■ 'C  on  1  jif  a;  j 'licatiwus,  tiiut  c,u'i)  a'iion  hiUi 
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an  information  collecting  role  impHes  a  tradeoff  between  the  expected  gain  of  infor¬ 
mation,  which  is  related  to  fui  jLrt  performance,  and  the  immediate  reinforcement, 
which  measures  the  current  system  performance  [8]. 

1.3.9  BOXES 

BOXES  [8]  is  a  simple  implementation  of  a  learning  controller.  The  state 
space  is  discretized  into  disjoint  regions,  and  the  learning  algorithm  maintains  an 
estimate  of  the  appropriate  control  eiction  for  each  region.  Associated  with  any 
approach  using  a  discrete  input  space  is  an  exponential  growth  in  the  number  of 
bins,  as  the  state  dimension  or  the  number  of  quantization  levels  per  state  variable 
increases  [3]  Therefore,  quantization  of  the  state  space  is  seldom  an  efficient  map¬ 
ping  technique  and  a  learning  algorithm  that  uses  this  strategy  can  generally  only 
represent  only  a  course  approximation  o  a  continuous  t  ontrol  law.  Altuough  this 
lookup  table  teclmique  facilitates  some  aspect.s  of  implemr  utation,  any  parameter¬ 
ized  function  approximatiot!  scheiisc  ca|>able  t  f  representing  continuous  fimctions 
will  i>e  more  etlicieiit  with  respect  to  the  necesst  ry  number  of  free  parameters.  Ad- 
dilionally,  ge  ihzation  is  inherent  U>  such  con  muous  mapjnng.i  (oj.  A  BOXES 
approach  exhib'  s  loccJity  in  learning,  but  does  not  generalize  information  aciofis 
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The  Aeroelastic  Oscillator 


.  M  General  Description 


A  simple  aeroelastic  sciilaior  (AEOj  may  be  modeled  as  a  classical  mass- 
spriag-dashpot  system  with  the  addition  of  two  external  forces:  an  aerodynamic 
lift  force  and  a  control  force  (Figure  2.1).  The  masts  a  rectangular  block  exposed 
to  a  steady  wind  is  constrained  to  translate  in  th*»  direction  normal  to  the  vector 
of  the  incident  wind  and  in  the  plane  of  the  page  m  i'jgufc  2  i  .  Specifications 
of  the  AEO  plant  are  borrowed  from  Parkinson  and  Smi't.h  [9]  as  well  as  tjx»m 
hot  ij  s,)u  .  rd  Stewart  [10].  The  low  dimensionality  of  the  dynamic  state,  which 
coiisi.sts  O'  the  fKxsition  x(<)  ,;uid  the  velocity  of  the  mass,  reduces  the  complexity 
of  'oinp  j tc.  ;irnulations  aaid  allows  the  systeu,  dynamics  tir  be  ea-  ily  vi?‘wed  m 
a  two  dirneiisi.  nal  phicse  plajie.  Tiie  AEO  exhibits  a  coinbinatii'a  cf  inten'sting 
nonlineL.,  dynamics,  generated  by  the  uonline.8j  aeroi’vns.mic  lift,  and  parameter 


ATTACHMENT  3 


2.  i'  Thf;  >^quation8  of  Motion 

'  .ncert  ainty  that  constitute  a  f''od  context  in  which  to  study  learning  as  a  method 
of  iiii  rementally  synthesizing  an  optimal  control  law.  The  control  objective  may  be 
either  regulating  the  state  near  the  origin  of  the  phase  plane  or  tracking  a  reference 
ti-ajectory. 


m,  Fo(t) 


Figure  2.1.  The  aeroelastic  oscillator. 


2.2  The  Equations  of  Motion 

¥ 

I'o  inve.siigate  the  .4E()  dyiHuiiics,  the  bioek  is  modeled  fis  a  point  mass  at 
which  all  forces  act.  'fhe  hoinogeiitvais  equatkm  of  motion  for  the  iwrc-ela-stic  oscil 
’ator  i.s  a  s^'cond  order,  linear,  differential  equation  wdth  constant  coefficients.  This 
cqviation  ar'cnratelv  i<'present.s  the  jihysicai  system  for  zero  iiicidt  at  windspt'ed,  in 
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the  range  of  block  position  and  velority  for  which,  the  spring  and  dashpot  respona 
linearly. 


Table  2.1.  Physical  variable  definitions. 


1 

Physical  Property 

Symbol 

Block  Position 

x(t) 

^lock  Mass 

i 

Damping  Coefficient 

1 

1  r 

! 

Spring  Coefficient 

1  k 

1 

For  the  undnven  sysiem,  the  block  position  may  be  described  as  a  function  of 
time  by  a  weighted  sum  of  exponentials  whose  powers  are  the  roots  of  the  charac¬ 
teristic  equation. 


r(t)  —  c,e*'  f  cjc' 


-r  ±  yjr^  ~  Amk  -r  .  1 


2m 


±  v/r^  -  Amk 
2m  2m 


k  >  0  and  r,  m  >  0  ->  S  0 


(2.2) 

(2.3) 

(2.4) 


The  condition  that  the  dashpot  rtx'fficient  is  positive  and  the  spring  ccveriicient  is 
non- negative,  iuip.lies  that  the  position  and  velocity  transieuis  wil!  decay  exponen 
tially.  Fhi.s  naforced  o^rsti  Jii  pusse...,eo  a  1 ajuiLiu  mni  ihe  origi.u  of  tlie  {dia,se 

plane. 
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2.2  The  Equatious  of  Motion 

The  aerodynamic  lift  Z(i)  and  control  force  Fo(t)  constitute  the  driving  com¬ 
ponent  of  the  equation  of  motion.  Including  these  forces,  the  equation  of  motion 
becomes  a  non- homogeneous,  second-order,  dift’erential  equation  with  constant  co¬ 
efficients. 

-i-  +  Fo  (2.5) 

dt^  dt 


0  2  4  6  8  iO  12  14  16 

Angle  of  Attat  k  o  in  Degrees 

Figure  2.2.  The  a<‘nH*iastic  oscillator  nonlinear  cot'ffirient  of  lift. 

The  lift  force  is  a  nonlinear  function  of  the  effective  angle  of  attack  of  the 
lUJucs  block  with,  respect  to  the  incident  lur  How.  No  cuiiriit  ar'rod yn.uinc  theorv 
provides  iui  tundytic  ineth  >d  for  predicting  th'  How  aioi.ind  an  excited  re.  tai:gnl,\j 
lilock  riicrefore,  tlie  cixnlicicnt  of  lift  ns  i-.ppro;-, iinatisl,  u.siug  eiMpincaJ  data,  a-s  a 
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seventh -order  polynomial  in  the  tangent  of  the  effective  angle  of  attack  a  (Figure 
2.2)  [9,10].  This  approximation  to  the  empirical  data  is  valid  for  a  range  of  angles 
of  attack  near  zero  degrees,  ja|  <  18°. 


L-^^pV^klCL  (2.6) 


Figure  2.3.  The  total  velocity  vector  Ve  and  the  effective  angle  of 
attack  a. 


T^ble  2,2.  Additional  physical  vainable  definitions. 


Physical  Property 

Symbol 

Density  of  Air 

P 

Velocity  of  Incident  Wind 

V 

Area  of  Cross-section  of  Mass  Block 

hi 

Coefficient  of  Lift 

Cl 
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Following  from  the  absei^ce  of  even  powders  of  ^  in  the  polynomial  (2.7),  the 
coefficient  of  lift  is  {ui  odd  symmetric  function  of  the  angle  of  attack,  which,  given 
the  geometxy  of  the  AEO,  seems  physically  intuitive.  The  definition  of  the  effective 
angle  of  attack  is  most  apparent  from  the  perspective  that  the  AEO  is  moving 
through  a  stationary  fluid.  The  total  velocity  Ve  eqnals  the  sum  of  two  orthogonal 
components;  the  velocity  associated  with  the  oscillator  as  a  unit  translating  through 
the  medium  (i.e.  the  incident  flow  V ),  and  the  velocity  x  associated  with  the  mass 
block  vibrating  with  respect  to  the  £Lced  terminals  of  the  spring  and  dashpot  (Figure 
2.3).  This  total  velocity  vector  will  form  an  effecti  ve  angle  of  attack  o  with  respect 
to  the  incident  flow  vector. 

The  dimensional  equation  of  motion  (2.5)  can  be  nondiinensionaiized  by  di¬ 
viding  thr  ough  by  kh  and  applying  the  rules  listed  in  Table  2.3.  The  resulting 
equation  of  motion  may  be  written  as  (2.9)  or  equivalently  (2.10). 


The  coefficient  of  lift  is  parameterized  by  the  following  foia'  einpiricdly  de 
termined  constants;  Aj  =  2.69,  —  168,  As  =  6270,  Aj  =-  59900  (9.,10j.  The 
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othe"  nondsmersionzil  system  parameters  were  selected  to  provide  interesting  non¬ 
linear  dynamics:  n  =  4.3  •  19~^,  $  =  1.0,  and  ~  =  1.6.  These  parameters  denne 
Uc  "  1729.06  and  U  —  2766.5,  where  the  nondimensionai  critical  windspeed  Uc  is 
defined  in  §2.3.  The  nondimensionai  time  is  ejtpressed  in  radians. 


Table  2.3.  Required  changes  of  veniables. 


New  Variables 

Relationships 

Reduced  displacement 

A  —  ^ 

Mass  parameter 

Natural  frequency 

R.educed  incident  windspeed 

tf  =  T 

Ufh 

Damping  parameter 

^  ■ "  2mZ 

Reduced  time  (raiiians) 

T  -■  iOt 

Nondimensionai  Control  Force 

F'  — 

The  transformation  from  nondimensionai 

pararne  (n,  and  ^  ) 

merisional  parameters  {p,  h,  I,  m,  V,  r,  and  k)  is  not  uniqtie.  Moreover,  the 
nondimensionai  parameters  that  appear  above  will  uoc  transform  to  any  physically 
realistic  set  of  dimensional  parameters.  However,  tliis  set  of  non  dimensional  param¬ 
eters  creates  fast  dynamics  which  facilitates  the  analysis  of  learning  techniques. 

An  additional  change  of  variables  scales  the  maximum  amplitudes  of  the 
block’s  displacement  iU)d  velocity  to  approximately  unity  in  order  of  magnitude. 
The  dynamics  that  axe  u-sed  throughout  ihi.=  thesis  for  experiments  with  the  aeroe- 
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2.3  The  Opeu-loop  Dynamics 


iaustic  oscillator  appear  in  (2.12). 


(2.11) 


Equation  (2.12)  may  be  further  rewritten  as  a  pair  of  first-order  differential 
equations  in  a  state  space  realization.  Although  in  the  dimensional  form  i  =  ^, 
in  the  nondimensional  form,  X  =  ^. 


X  X,  = 


Xi  =  Xj  xa  ~ 


(2.13) 


ij  ~  [-1  rxA^V  -2^]  [xj  ^ 

/(xi)  -  --5V0  +  ^(lOOOx,)®  -  (lOOOx,)"]  (2.146) 


2.3  The  Open-loop  Dynamics 


The  reduced  critical  windspeed  L'c,  which  depends  on  the  nondimensional 
rruiss  para.meter,  the  damping  parameter,  and  the  first-order  coefficient  in  the  oocf 
ficient  of  lift  poiynornial,  is  the  iralue  of  the  incident  windspeed  at  which  the  negative 
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linear  aerodynamic  damping  exceeds  the  positive  structural  damping. 


f/c  = 


2/9 

riAi 


(2.15) 


I*  igure  2.4»  The  aeroelastic  oscillator  open-loop  dynamics.  An  outer 
stable  limit  cycle  surrounds  an  unstable  limit  cycle  that 
in  this  picture  decays  inward  to  an  inner  stable  hruit 
cycle. 

The  nature  of  the  open- loop  dynauiics  is  .strongly  dependent  on  the  ratio  of  the 
reduced  incident  wiadspeed  to  the  reduced  ciitical  windspeed.  At  \'alues  of  the 
incident  windspeed  below  the  critical  walue.  ..he  h  eus  of  the  |>liase  plane  is  stalde 
'  File  tern,  rrdmed  is  syaor.ymous  with  no.'ndinieu^ionaF 
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and  the  state  of  the  oscillator  wall  retiim  to  the  origin  from  any  perturbed  initial 
condition.  For  windspeeds  greater  than  the  critical  value,  the  focus  of  the  two 
dimensional  state  space  is  locally  unstable;  the  system  will  oscillate,  follovring  a 
stable  limit  cycle  clockwise  arotind  the  phase  plane.  The  aeroelastic  oscillator  is 
globally  stable,  in  a  bounded  sense,  for  all  U.  ^  The  existence  of  global  stability 
is  suggested  by  the  coefficient  of  lift  curve  (Figure  2.2);  the  coefficient  of  lift  curve 
predicts  zero  lift  ( Ci  =  0 )  for  a  =  ±15.3®  and  a  restoring  lift  force  for  leurger  loj. 
That  the  aeroelastic  oscillator  is  globally  open- loop  stable  eliminates  the  necessity 
for  a  feedback  loop  to  provide  nominal  stability  during  learning  experiments.  For 
incident  windspeeds  greater  than  1/^  a  limit  cycle  is  generated  at  a  stable  Kopf 
bifurcation.  In  this  simplest  form  of  dynamic  bifurcation,  a  stable  focus  bifurcates 
into  an  unstable  focus  surrounded  by  a  stable  limit  cycle  under  the  variation  of  a 
single  independent  parameter,  Ue-  For  a  range  of  incident  wind  velocity,  two  stable 
limit  cycles,  separated  by  an  unstable  limit  cycle,  characterize  the  dynamics  (Figure 
2.4).  Figure  2.4  was  produced  by  a  200Hz  simulation  in  cx)ntinuous  time  of  the  AEO 
equations  of  motion,  using  a  foxirth-order  Rxinge-Kutta  integration  algorithm.  An 
analysis  of  the  open-loop  dynamics  appears  in  Appendix  B. 


2.4  Benchmark  Controllers 


A  simulation  of  the  AEO  equations  of  motion  in  continuous  time  was  imple 

inented  in  the  NetSim  environment.  NetStm  is  a  general  purpose  simulation  and 

^  Each  stale  trajectory  is  a  member  of  (>-e.  l|.r(t)||^^  is  fLoite)  for  all  perturbations 
6  with  bounded  Euclidean  norms,  j|^j|j . 
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design  software  package  developed  at  the  C.  S.  Drapei’  Laboratory  [11].  Ten  NeiSim 
cycles  were  completed  for  each  nondimensional  time  ^onit  while  the  equatioas  of  mo¬ 
tion  were  integrated  over  twenty  steps  using  a  fourth-order  Range  Kutta  algorithm 
for  each  NctSim  cycle. 

Two  simple  control  laws,  based  on  a  lincarizatjon  of  the  AEO  equations  of 
motion,  •will  serve  as  bendxmarks  for  the  learning  controllers  of  §3,  §4  and  §5. 

2.4.1  Linear  Dynamics 

Prom  (2.14a),  the  linear  dynamics  about  the  origin  may  be  expressed  by  (2,16) 
where  A  and  S.  given  in  (2.17). 

sir)  =  Ai(r)  +  Bu{t)  (2.16) 


^  -I  nAiU-2^]  -"“[l] 

This  linearization  may  be  derived  by  defining  a  set  of  peiturbation  r'ariables,  x(r)  ~ 
lo  i^ir)  a^d  u(r)  =  uo  +  Su{t),  which  must  satisfy  the  differential  equations. 
Notice  that  ^{r)  ~  x(r).  The  expansion  of  ^(r)  in  a  Taylor  series  about  'xq,  uq) 
yields  (2.18). 


^iCt)  =  /  [So  +  Sxi  ),  Uo  -f  «5u(  r)| 


(2.18) 


X14),  Wo)  + 


dx 


Lx{r)  4- 


.€o.“o 


01 

du 


6u{t)  f 


If  the  pair  (  xo,tio)  represents  an  equilibrium  of  the  dyn,Mnics,  then  f{  T^,un)  “ 
0  by  definition.  Equation  (2.  IG)  is  achieved  by  di.scai-diug  the  norJinear  terms  of 
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(2.18)  and  applying  (2.19),  where  A  and  B  are  the  Jacobian  matrices. 


d-u 


(2.19) 


2.4.2  The  Linear  Quadratic  Regulator 


The  LQR  solution  minimizes  a  cost  functional  J  that  is  an  infinite  time 

horizon,  integral  of  a  quadratic  expression  in  state  and  control.  The  system  dynamics 

must  be  linear.  T.he  optimal  cxantrol  is  given  by  (2.21) 

~  [aiC'f )’"£('»“)  +  dr 

(2.20) 

u(t)  =  -^^i(r) 

(2.21) 

r0.4142 
“  13.0079. 

(2.22) 

The  actuator.s  which  apply  the  control  force  to  the  AEO  eire  assumed  to  saturate  at 
±0.S  nondimeiisional  forc.e  units.  Therefore,  the  control  law  tested  in  this  section 
wjis  written  as 

t,(r)  /  {-  [0.4142  3.0079]  r(r)) .  (2.23) 

(  0.5,  if  a;  >  0.5 

/(r)  =  I  -0.5,  ifr<-0.5  (2.24) 

I X,  otherwise. 

The  state  t.raif"Ctory  which  resulted  from  applying  the  control  law  (2.23)  to  the 
AEO,  for  the  initial  conditions  {  —  1.0, 0.5),  appears  in  Figure  2.5.  The  controller 
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applied  the  marimimt  force  irntil  the  state  approached  the  origin,  where  the  dy¬ 
namics  are  nearly  linear  (Figure  2.6).  Therefore,  the  presence  of  the  nonlinearity  in 
the  dynamics  did  not  strongly  influence  the  performance  of  this  control  law. 

If  the  linear  dynamics  were  modeled  perfectly  (as  above)  and  the  magnitude 
of  the  control  were  not  limited,  the  LQR  solution  would  perform  extremely  well. 
Model  uncertainty  w.a3  introduced  into  the  a  priori  model  by  designing  the  LQR 
controller  assuming  the  open-loop  poles  were  0.2  ±  1.8j. 


r  0  11 

[-3.28  0.4  j 


(2.26) 


The  LQR  solution  of  (2.20)  using  A!  is  ^  ~  [0.1491,1.6075].  This  control  law 
applied  to  the  AEO,  when  the  magnitude  of  the  applied  force  was  limited  at  0.5, 
produced  the  results  shown  in  Figures  2.7  and  2.8.  The  closed-loop  system  was 
significantly  under-damped. 

2.4.3  Bang-bang  Controller 

The  bang-bang  controller  was  restricted  to  two  control  actions,  a  maximum 
positive  force  (0.5  nondixnensionaJ  units)  and  a  maximum  negative  force  (  -  0.5); 
this  limitation  will  also  be  imposed  on  the  initial  direct  learning  experiments.  The 
control  law  is  derived  from  the  LQR  solution  and  is  non-  optimal  for  the  AEO  system. 
In  the  half  of  the  state  spjice  where  the  LQR  .solution  specifies  a  positive  force,  the 
bang  bang  control  law  (2.25)  applies  the  maximum  positive  for  'e.  Similarly,  in  the 
hiiif  of  the  stare  spa<:.x^  wh.'^re  the  LQR  solution  specifies  a  negative  foTce,  tlje  '>ang-- 
b<mg  control  law  applie.s  the  ma:dmuiu  negative  forex'.  The  switching  lux*  which 
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Position 


Pigure  2.5. 


The  AEG  state  trajectory  achieved  by  a  magnitude 
hmited  LQR  control  law. 


Figure  2.6.  The  LQR  control  hi.story  mid  tlie  Imiited  force  which 
yteld.s  Figure  2.5. 
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divides  the  state  space  passes  through  the  origin  with  slope  —0.138;  this  is  the  line 
of  zero  force  in  the  LQR  solution. 


=  /  0.5,  if  -S^i(r)  >  0 
\  —0.5,  otherwise. 


(2.25) 


The  result  of  applying  this  bang-bang  bontrol  law  to  the  AEO  with  initial 
conditions  {—1.0, 0.5}  appears  in  Figure  2.1i  .he  trajectory  initially  traces  the 
trajectory  in  Figure  2.5  because  the  LQR  solution  was  saturated  at  -0.5.  However, 
the  trajectory  slowly  converges  toward  the  origin  along  the  line  which  divides  the 
positive  and  negative  control  regions,  while  rapidly  alternating  between  exerting  the 
maximum  jjositive  force  and  maximum  negative  force  (Figure  2.8).  Generally,  this 
would  represent  imacceptabie  performance.  The  bang-bang  control  law  represents 
a  two- action,  linear  control  p>olicy  and  will  serve  as  a  non-optimal  benchmark  with 
which  to  comp^u^  the  direct  learning  control  laws.  The  optimal  two-action  control 
law  cannot  be  written  from  only  a  brief  inspection  of  the  nonlinear  dynamics. 


383 


ATTACHSI^ENT  3 


Chapter  3 

The  Associative  Control  Process 


The  Associative  Control  Proc-ess  (ACP)  network  [12,14]  models  certain  funda¬ 
mental  aspects  of  the  animal  nervous  system,  accoimtiiig  for  munero’is  cleissical  and 
instrumental  couditioziing  phenomena.  ^  The  original  ACP  network  was  intended 
to  model  limbic  system,  hypothalamic,  and  sensorimotor  hmction  as  well  as  to  pro¬ 
vide  a  general  framework  v/ii,hin  which  to  relate  animai  learning  psychology  and 
control  theory.  Through  real-time,  closed-loop,  gofJ  seeking  interactions  between 
the  Ie.aming  system  and  the  environment,  the  A.CP  algorithm  can  achieve  solutions 
to  spatial  and  temporal  credit  assignment  problems.  This  capability  suggests  that 
the  ACP  algorithm,  which  accomplishes  reinforcement  or  self- supervised  learning, 
may  offer  solutions  to  difficult  optimal  control  problems. 


'  .iuimal  learning  phenvmiena  are  investigated  through  two  classes  of  laboratory  con¬ 
ditioning  procedur«»s.  Classical  condittoning  is  an  open-loop  process  in  which  the 
experience  of  the  animal  is  independent  of  the  behavior  of  the  animal.  The  experi¬ 
ence  of  the  animal  in  closed-loop  instnimental  conditioning  or  opcmnt  conditioning 
experiments  is  con'mgent  on  the  animal’s  behavior  [12]. 
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This  chapter  constitutes  a  thorough  description  of  the  ACP  network.  '?dewed 
from  the  perspective  of  applying  the  architccti.'xe  and  process  as  a  controller  for 
dynamic  systems.  ^  A  detailed  description  of  the  architectvrrc  and  functionality 
of  the  original  ACP  network  (§3.1)  serves  as  a  foundation  from  which  to  describe 
two  levels  of  modification,  intended  to  improve  the  applicability  of  the  Associative 
Control  Process  to  optimal  control  problems.  Initial  modifications  to  the  original 
ACP  specifications  retain  a  two-layer  network  structure  (§3.2);  several  difficulties 
in  this  modijied  ACP  motivate  the  development  of  a  single  layer  architecture.  A 
single  layer  formulation  of  the  ACP  network  abandons  the  biologically  motivated 
network  structure  while,  preserving  the  mathematical  basis  of  the  modified  ACP 
(§3.4).  This  minimal  represe"  ation  of  an  Associative  Corxtrol  Process  j>er{  rms 
an  incremental  value-iteration  proced.ure  similar  to  Q  learning  and  is  guarmteed  to 
converge  to  the  optimal  policy  in  the  ijifinite  horizon  optimal  control  problt rn  under 
certain  conditions  [13].  This  chapter  concludes  with  a  summary  of  the  application 
of  the  modified  and  single  layer  ACP  methods  to  the  regulation  of  the  aeroelastic 
oscillator  (§3.5  and  §3.6). 

3J.  The  Original  Associative  Control  Process 


The  definition  of  the  original  ACP  is  derived  from  Klopf  jl2],  Klopf,  Morgaxi, 

and  Weaver  [14],  £us  well  as  Baird  arid  Klopf  [13].  .Although  originally  introduced  in 

the  iiteratiire  as  a  model  to  predict  a  variety  of  miiinil  leJiriiing  re..sults  from  classical 

^  This  context  is  in  conirast  to  the  perspective  that  am  A(,’F  network  models  a-spects 
of  biological  Bvstems. 
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and  instrumental  conditioning  exjieriments,  a  recast  version  of  the  ACP  network 
has  been  shown  to  be  capable  of  learning  to  optimally  control  any  non-absorbing, 
finite-state,  finite-action,  discrete  time  Markov  decision  process  [13].  Although  the 
original  form  of  the  aCP  may  be  incompatible  with  infinite  tbjae  horizon  optimal 
control  problems,  as  an  introduction  to  the  ACP  derivatives,  the  original  ACP 
appears  here  with  at'  accent  toward  applying  the  learning  system  to  the  optimal 
control  of  dynamic  systems.  Where  appropriate,  analogies  to  ammal  learning  re¬ 
sults  motivate  the  pre.sence  of  those  features  of  the  original  ACP  architecture  which 
emanate  jErom  a  biological  origin.  Although  the  output  and  learning  equations  are 
centful  in  formalizing  the  ACP  system,  to  eliminate  ambiguities  concerning  the  in¬ 
terconnection  and  functionality  of  network  elements,  substantial  textual  description 
of  rules  is  reqtxired. 

The  ACP  network  consists  of  five  distinct  elements:  acquired  drive  sensors^ 
motor  centers,  reinforcement  centers,  primary  drive  sensors,  and  effectors  (Figure 
3.1).  In  the  classical  '’onditioning  nomenclature,  the  acquired  drive  sensors  represent 
the  conditioner!  stimuli;  in  the  context  of  a  control  problem,  the  acquired  drive 
sensors  encode  the  sensor  measmements  and  will  be  used  to  identify  the  discrete 
dynamic  state.  The  ACP  requires  an  interface  with  the  environment  that  o^ntmns  a 
finite  set  of  states.  Tiierefore,  for  the  appheation  of  the  ACP  tc  a  control  problem, 
the  state  space  of  a  dynamic  system  is  quantized  into  a  set  of  tn  disjoint,  non- 
uniform  bins  which  fill  the  entire  state  space.  ^  The  ACP  leaining  system  operates 
in  discrete  time.  At  any  stage  in  discrete  time,  the  state  of  the  dynamic  system 

A  sufficiesit  conaition  is  for  the  bins  to  fill  rhe  entire  operationaj  envelope,  i.e.  the 

region  of  the  stale  space  that  the  state  may  enter. 
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ypik)  -  yj^{k)  Effector 


Motor 

Centers 


Figure  3.1.  The  ACP  network  aicliiterture 
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3.1  The  Original  ACP 

ifl'ill  lie  within  exactly  one  binj  with  which  a  single  acquired  diive  sensor  is  uniquely 
associated.  The  current  output  of  the  acquired  drive  sensor.  Xi( A:),  will  he  either 
luiity  or  zero,  and  exactly  one  acquired  drive  sensor  will  have  unity  output  at  each 
time  step.  *  The  vector  of  m  acquired  drive  signals,  x{k),  should  not  be  confused 
with  the  vector  of  state  variables,  the  length  of  which  equals  the  dimensica  of  the 
state  space. 

A  motor  center  and  effector  pair  exists  for  each  discrete  network  output.  *  The 
motor  centers  collectively  determine  the  network’s  immediate  action  smd,  therefore, 
the  set  of  n  motor  centers  operate  as  a  single  policy  center.  In  animal  learning 
research,  the  effector  encodes  an  action  which  the  animal  may  choose  to  perform 
(e.g.  to  turn  left).  As  a  component  of  a  control  system,  each  effector  represents 
a  discrete  control  produced  by  an  actuator  (e.g.  apply  a  force  of  10.0  units).  The 
output  of  a  motor  center  is  a  real  number  and  should  not  be  confused  with  the 
output  of  the  ACP  network,  which  is  an  action  performed  by  an  effector. 

The  output  of  the  motor  center,  p>(fc),  equals  the  evaluation  of  a  nonlin¬ 
ear,  threshold-saturation  function  (Figure  3  2)  applied  to  the  weighted  sum  of  the 
acquired  drive  sen.sor  inputs. 


(3.1) 


f  0  if  X  <  ^ 

/n(x)  -  <  1  ifi>l  (3.2) 

1 X  otherwise 

This  coudilion  is  not  necessary  in  the  application  of  the  ACP  to  predict  anima'  learn 
ing  results. 

ReceJl  that  the  ACP  network  output  must  be  a  member  of  a  finite  set  of  control 
actions. 
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Figure  3.2.  The  output  equatioa  noalmearity,  (3.2). 

The  threshold  is  a  non-negative  constant  less  than  unity.  Justification  for  the 
presence  of  the  output  nonlineaiity  follows  directly  from  the  view  that  a  neuronal 
output  measures  the  frequency  of  firing  of  the  neuron,  when  that  frequency  exceeds 
the  neuronal  threshold.  ®  Negative  vi  uesof  y,(t),  representing  negative  frequencies 
of  firing,  are  not  physically  realizable. 

The  motor  center  output  equation  (3.1)  introduces  two  weights  from  each 
acquired  drive  sensor  to  each  motor  center:  a  positive  excitatory  weight  H^ij^(^) 
and  a  negative  inhibitory  weight  H^,j(^).  Biological  evidence  motivates  the  presence 
of  distinct  excitatory  and  inhibitory  weights  that  encode  attraction  and  avoidance 

The  term  neuwnal  output  refers  to  the  output  of  a  motor  center  or  a  reinforcement 
center. 

394 


6 


ATTACHMENT  3 


3.1  The  Original  AGP 

behaviors,  respectively,  for  each  state-action  pair.  The  time  dependence  of  the 
weights  is  explicitly  shown  to  indicate  that  the  weights  chemge  with  time  through 
learning;  the  notation  does  not  imply  that  functions  of  time  are  determined  for  each 
weight. 

Reciprocal  inhibition^  the  process  of  comparing  several  neuronal  outputs  and 
suppressing  all  except  the  Isirgest  to  zero,  prevents  the  motor  centers  that  are  not 
responsible  for  the  current  action  from  undergoing  weight  changes.  Heciprocal  inhi¬ 
bition  is  defined  by  (3.3).  The  motor  center  jmax(t)  which  wins  reciprocal  inhibition 
among  the  m  motor  center  outputs  at  time  k  will  be  referred  to  as  the  currently 
active  motor  center;  jmaxik  —  a),  therefore,  is  the  motor  center  that  was  active  a 
time  steps  prior  to  the  present,  and  yjVn«,(fc~o)(^)  is  tiic  current  output  of  the  motor 
center  that  was  active  a  time  steps  prior  to  the  present. 


—  j 

sucli  that  for  all  /  €  {1,  2,  . . .  n}  and  I  ^  j 

y,{k)  <  (3.3) 

The  current  network  action  conesponds  to  the  effector  associated  with  the 
single  motor  center  which  has  a  non-zero  output  aftei  reciprocal  inhibition.  Poten¬ 
tially,  multiple  motor  center's  may  have  equsdly  large  outputs.  In  this  case,  reciprocal 
inhibition  for  the  original  ACP  is  defined  such  that  no  motor  center  will  be  active, 
no  control  action  will  be  effected,  and  no  learning  will  occiur. 

The  ACP  architecture  contains  two  primary  drive  sensors,  differentiated  by  the 
labels  positive  and  negative.  The  primary  drive  sensors  provide  externa!  e'valuations 
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of  the  network’s  performance  in  the  form  of  non-negative  reinfon'emtnt  signals; 
the  positive  primary  drive  sensor  measures  reward  while  the  negative  primary  drive 
sensor  measures  cost  or  punishmen'i.  In  the  language  of  clftssicai  conditioning,  these 
evaluations  are  collectively  labeled  the  unconditioned  stimuli.  In  the  optimal  control 
framework,  the  reward  equals  zero  and  the  punishment  represents  an  evaluation  of 
the  cost  functional  which  the  control  is  attempting  to  minimize. 

The  ACP  architerture  also  contains  two  reinforcement  centers  which  axe  iden¬ 
tified  as  positive  and  negative  and  which  yield  non-negative  outputs.  Each  rein¬ 
forcement  center  learns  to  predict  the  occurrence  of  the  correspcnding  external 
reinforcement  and  consequently  serves  as  a  source  of  internal  reinforcement,  allo*/- 
ing  learning  to  continue  in  the  absence  of  firequent  external  reinforcement.  In  this 
w£iy,  the  two  reinforcement  centers  direct  the  motor  centers,  tlirough  Jeaming,  to 
select  actions  such  that  the  state  approaches  reward  and  avoids  cost. 

Each  motor  center  f&ciliiates  a  pair  of  excitatory  and  inhibitory  weights  from 
each  acquired  drive  .sensor  to  each  reinibreement  renter.  The  output  oi  the  positive 
reinforcement  center,  prior  to  reciprocal  inhibition  between  the  two  reinforcjemeot 
centers,  is  the  sum  of  the  positive  external  reinforcjmeat  rp[k)  and  the  weighted 
sum  of  the  acquired  drive  sensor  iap-sts.  The  appropriate  set  of  weights  horn  the 
acquired  drive  sensors  to  the  reinforcement  center  cori'esponds  to  the  carrerdiy 
active  meterr  center.  Therefore,  calculation  of  the  outputs  of  the  reinforcement 
centers  requires  prior  detenniaiation  of  },nax{k)- 


yp(k)  fr. 


r„{k)  +  V  I.(fc) 


«=l 


(3.4) 


The  output  of  the  negative  reinforcemeiit  cjenter  yAr(^)  is  calculated  similarly,  using 
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the  negative  external  reinforcement  rjv(A;). 


m 

=  U  Mk)  +  £  Ii(i) 


i:=l 


(3.5) 


The  ACP  leandiig  mechanism  improves  the  stored  policy  and  the  predictions 
of  futme  reinforcements  by  adjusting  the  weights  which  connect  the,  acquired  drive 
sensors  to  the  motor  and  reinforcement  centers.  If  the  motor  center  is  active 
with  the  acquired  drive  sensor,  then  the  reinforcement  center  weights  Wpij{k) 
aud  arc  eligible  to  change  for  r  subsequent  timi?  steps.  The  motor  center 

weights  are  eligible  to  change  only  during  the  current  time  step.  ^  Moreover, 

all  weights  for  other  state-action  pairs  will  remain  constant  this  time  step. 

The  impetus  for  motor  center  learning  is  the  difference,  after  reciprocal  inhi¬ 
bition,  between  the  outputs  of  the  positive  and  negative  reinforcement  centers.  The 
following  equations  define  the  incremental  changes  in  the  motor  center  freights, 
where  the  constants  Ca  and  C{,  are  non-negative.  The  nonlinear  ftmetion  f,  in 
(3.6),  defineti  by  (3.9),  requires  that  only  pcsitive  changes  in  presynaptic  activity, 
Axi{k)^  stimuiate  weight  changes. 


i.W*(k)  = 


!  <t)  |W^*(i:)|  /.  (Ali(A:))  l»p(t)  y«(l.)  ~  y,(l)l 
•  0 


if  j  "  im«c(f-) 


otherwise 

(.16) 


c(i:)  =  c.  4  -  yw(*,)l  (3.7) 

The  weights  oi  both  positive  and  negative  reinforcemeut  centers  are  eligible  for  change 
even  though  both  reinforceaieni  centers  cannot  win  redprcKral  inhibiUon.  In  contrsist, 
only  the  motor  center  that  wins  reciprocal,  inhibition  can  experierjce  weight  changes. 
.If  no  motor  center  is  currently  active,  however,  no  learning  occurs  in  either  t  he  motor 
centers  or  the  reiiiloi  cerr.enl  centers. 
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Ax,(fc)  =  Xi(k)  -■  Xi(k  —  1)  (3.8) 

f  if  X  >  0 

■’  1 0  otherwise 

The  learning  process  is  divncied  into  temporal  intervals  referred  to  as  trials;  the 
weight  changes,  which  are  calculated  at  each  time  step,  are  accumulated  throughout 
the  trial  and  implemented  at  the  end  of  the  trial.  The  symbols  ko  and  Ifcy  in  (3.10) 
represent  the  times  before  and  after  a  trial,  respectively.  A  lower  bound  on  the 
magnitude  of  every  weight  maintains  each  excitatory  weight  rlways  positive  and 
eacii  inhibitory  weight  always  negative  (Figures  3.3  and  3.4).  The  constant  a  in 
(3.11)  is  a  positive  network  parameter. 


k^ho 


Wr(kj)=M- 


(3.10o) 

(3.106) 


/iv+(r)=:|“ 


if  I  <  a 
otherwise 


(3.11a) 


/»-  (i)  =  f 

V  X 


ii  X  . a 

otherwise 


(3.115) 


Equavioas  (3.12)  through  (3.15)  define  the  Drive -Reinforcement  (DR)  learniug 
mechaiii.sm  used  in  the  positive  reinforcement  center;  negative  reinforcement  center 
learning  follows  directly  [12,14].  Drive- Reinforcement  learning,  which  is  a  flavor  of 
temporal  difference  learning  [15],  changes  eligible  connection  weight.s  as  a  function 
of  the  correlation  betwf'eii  earlier  changes  in  input  signals  and  later  changes  in 
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/mi^+(x) 


Figure  3.3.  The  lower  bound  on  excitatory  weights,  (3,11a). 


Figure  3.4.  The  upper  boiiud  on  inhibitory  weights,  (3.11b). 

output  sigufUs,  The  coiistruit.s  r  (which  in  mruiial  leitrning  i t,:{,'rf,sei!ts  the  lougc'st 
iuterstimulu'-;  interva)  over  -which  delay  eor,ulitioiu.ng  is  eilertive)  rmii  <\,  Cj,  .  .  .  <-, 
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axe  non-negative.  Whereas  r  may  be  experimentally  deduced  for  animal  learning 
problems,  selection  of  an  appropriate  value  of  r  in  a  control  problei ;  typically 
requires  experimentation  with  the  particular  application.  The  incremental  change  in 
the  weight  associated  with  a  reinforcement  center  connection  depends  on  four  terms. 
The  correlation  between  the  current  change  in  postsynaptic  activity,  Ayp(k),  and 
a  previous  change  in  presynaptic  activity,  Ai,(lr  — a),  is  scaled  by  a  learning  rate 
constant  Ca  and  the  absolute  value  of  the  weight  of  the  connection  at  the  time  of 
the  change  in  presynaptic  activity. 


=  AypW  5;  c, 

a=l 


f,  iAx,j(k  -  a)) 


(3.12) 


Ayp{k)  -  yp{k)  -  yp{k  -  1) 


(3.13) 


lo 


-  1)  if  >  =  -  a) 

otherwise 


(314) 


--- 

“  jw- 


(315a) 

(3156) 


Note  that  the  accumulation  of  weight  changes  until  the  completion  of  a  trial  eliiiu- 
nates  the  significiuice  of  the  time  shift  in  the  term  in  (312). 

The  credit  tussignment  problem  refers  to  the  situation  that  some'  judiciou.s 
(  hi)ice  of  Hctioii  at  the  j)res!“nt  time  may  yield  little  or  no  iiiiiiK'diate  r<‘turii,  rel 


ative  to  other  possible  actions,  but  may  Jillow  maximization  of  future  retuiiisT 

'rhe  term  rttum  deiiote.s  a  single  reinforcement  .uinial  that  isiu.Us  the  reward  iniini.s 
the  cost.  Ill  an  environinent  th.it  iiiea.siire.s  sminlt.uieous  iniii  zero  reward  and  cost 
sigiKils,  a  (ontrolhr  should  iiiaxiinize  the  return. 
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The  assessment  of  responsibility  among  the  recent  actions  for  the  current  return 
is  accomplished  through  the  summation  over  the  previous  r  time  steps  in  the  re¬ 
inforcement  center  learning  equation  (3.12).  In  the  negative  reinforcement  center, 
for  example,  a  correlation  is  achieved  between  Ay/v(^)  Mid  the  previous  r  state 
transitions.  This  process  of  relating  the  current  Ay^  to  the  previous  Ax ’s  is  re¬ 
ferred  to  as  chaining  in  animal  learning.  The  learning  rate  coefScients  discount  the 
responsibility  of  previous  actions  for  the  current  cliange  in  predicted  return,  where 
the  reinforcement  center  outputs  are  predictions  of  %tme  costs  and  rewards.  Bio¬ 
logical  evidence  suggests  that  uo  correlation  exist.*:;  L>etween  a  simultjineous  action 
and  a  change  in  predicted  return,  i.e.  co  =  0,  and  )  >  c,  >  c,  for  1  <  j  <  r. 

3.2  Extension  of  the  AGP  to  the  Infinite  Horizon,  Optimal 
Control  Problem 


Limited  modificationh  to  the  aichiteeture  aiid  functionality  of  the  original  As¬ 
sociative  C’ontiol  Process  result  m  u  network  with  improved  applicability  to  optiimj 
ronirol  probleiu.s.  Alttunigli  B;urd  juid  Kiopf  [131  have  snggt*steii  that  this  modtfir.d 
.4CP  will  converge  to  tlie  optimal  control  policy  ivuder  reoMmablr  a-ssumptions,  the 
analysis  in  §3.3  and  the  results  in  §3.6  suggest  that  the  necessary  conditiou-i  to 
obtain  an  optimal  solution  may  be  restrictive.  I'hi.s  ,s<‘ction  is  im  luded  to  follow*  the 
devclo})nieut  • 'f  (he  AC’l’  and  to  motivate  the  singh'  layer  ALP  ftrclutecture  'The 
detiiiition  ot  tlic  iiuniijifd  .M'P  follows  from  Hand  and  Kiopf  [13]. 

['he  uio«.htle(l  .At.’P  S'-;  apphcablt'  to  u  .speciah.'cd  class  (.)f  jn oblcins;  the’  e-n 
vii'ouuit  ut  with  which  ihe'  AC’P  mte-iacts  must  be  a  non  absorbing,  fiiute’ state', 
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finite-action,  discrete-time  Markov  decision  process.  Additionally,  the  interface  be¬ 
tween  the  AGP  and  the  environment  guarantees  that  no  acquired  drive  sensor  will 
exhibit  unity  output  for  more  than  a  single  consecutive  time  step.  This  stipulation 
results  in  non-uniform  time  steps  that  are  artificially  defined  as  the  intervals  which 
elapse  while  the  dynamic  state  resides  within  a  bin.  ®  The  learning  equations  of  the 
original  AGP  can  be  simplified  by  applying  the  fact  that  x,(it)  €  {1,0}  and  will 
not  equal  unity  for  two  or  more  consecutive  time  steps.  Accordingly,  (3.8)  and  (3.9) 
yield, 

A(Ax.(A:))  =  ]  ^  ''  (3.16) 

1 0  otherwise. 

Therefore,  a  consequence  of  the  interface  between  the  AGP  and  the  environment  is 
ft  (Axi{k))  =  Xi{k).  A  similax-  result  follows  from  (3.9)  and  (3.14). 

ft  ^Ax,j(k  -  a))  ^  7  1  -  «)  ^3  17) 

1 0  otherwise 


The  role  of  the  reinforcement  center  weights  becomes  more  well  defined  in  the 
modified  AGP.  The  sum  of  the  inhibitory  and  excitatory  weights  in  a  reinforcement 
center  estimate  the  expected  discounted  future  reinforcement  received  if  action  j  if? 
performed  in  state  t,  followed  by  optimal  actions  being  performed  in  all  subsequent 
states,  lb  achieve  this  significruice,  the  reinforcement  center  output  and  learning 
equations  must  be  recast.  The  external  reinforcement  term  does  not  appear  in  the 
output  equation  of  the  reinforcement  center;  e.g.  (3.4)  bet:omes, 


y/ 


•  rn 

[k]  f„  l(^))  ^rik) 


(3.18) 


Similar  to  §3.1,  the  stale  space  is  quantizwl  intc  Li 


IIB. 
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The  expression  for  the  change  in  the  reinforcement  center  output  is  also  slightly 
modified.  Using  the  example  of  the  negative  reinforcement  center,  (3.13)  becomes, 

=  7s/A^(.t)  -  yN(k  -  1)  -f-  rN(k)  where  0  <  7  <  1.  (3.19) 

If  the  negative  reinforcement  center  accurately  estimates  the  expected  discounted 
future  cost,  Ayff(k)  will  be  zero  and  no  weight  changes  will  occur.  Therefore,  the 
cost  to  complete  the  problem  from  time  A:  —  1  will  appro  ornately  equitl  the  cost 
accrued  from  time  k  -I  to  k  plus  the  cost  to  complete  the  problem  from  time  k. 

yN{k  -  1)  -  yysik)  -f  rpf{k)  when  Ays{k)  -  0  (3.20) 

The  value  of  rf^(k),  therefore,  represents  the  increment  in  the  cost  functional  AJ 
from  time  fc  — 1  to  k.  Recall  that  time  steps  are  an  artificially  defined  concept  hi 
the  modified  ACP;  the  cost  increment  must  be  an  assessment  of  the  cost  functional 
over  the  real  elapsed  time.  “  The  possibility  that  an  action  selected  now  does  not 
significantly  effect  the  cost  in  the  fai'  future  is  described  by  the  discount  factor  7, 
which  cdao  guarantee  the  convergence  of  the  infinite  horizon  sum  of  discoimted 
future  costs. 

The  constants  in  (3.7)  are  defined  as  follows:  |  and  Ci,"0.  Additionally, 

the  terms  which  involve  che  absolute  values  of  the  w'eights  aie  removed  from  both 
the  motor  center  learning  eo’aation  and  the  reinforcement  center  learning  equation. 
Tliis  statement  is  strictly  true  h  r  7  —  1. 

rime  is  di.scret'‘  in  this  system.  Time  steps  will  coinciJe  with  an  izitegral  number  of 
discrete  I'me  increments. 
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Equations  (3.6)  and  (3.12)  arc  written  as  (3.21)  and  (3.22),  respectively.  With 
the  absence  of  these  terms,  the  distinct  excitatory"  and  iiJbdbitory  weights  coviid  be 
combined  into  a  single  weight,  whicli  can  assume  positive  or  negative  values.  This 
change,  however,  is  not  made  in  (13|. 

1 0  otherwise 

AWpi^(k)  -  Ayp(k)  ^  c«,f,  (Ax,-, (it  -  o)) 

a=l 

The  motor  center  learning  equation  (3.21)  causes  the  motor  center  weights  to  be 
adjusted  so  that  W^(k)  4  W,^(k)  will  copy  the  corresponding  sum  of  weights  for 
the  reinforcement  center  that  wins  reciprocal  inhibition.  The  saturation  limits  on 
the  motor  center  outputs  are  generalized;  in  contrast  to  (3.2),  /n(x)  is  redefined  as 

fn>(x). 

C  -0  if  X  <-0 

/n'(x)  =  I  ^  if  X  >  (3.23) 

V  X  otherwise 

Additionally,  the  definition  of  reciprocal  inhibition  is  adjusttisd  slightly;  the  non- 
maximizing  motor  c.enter  outputs  are  suppressetl  to  a  minimmn  value  —0  wlxicb  is 
not  necessarily  zero. 

Although  the  leaining  process  is  still  diviiied  into  trials,  the  weight  increments 
are  incoi'porated  into  the  weights  at  every  time  step,  instead  of  aftei"  a  trial  has 
been  completed,  blqnations  (3.10)  and  (3.15)  axe  now  written  as  (3.24'f  and  (3.25), 
respectively. 

W'-j  Cfc;  =.  fry,  -  i)  -f  AW,p^)]  (Z.24a) 

{k)  -  /„.-  [Wf  {k  -  1)  4  AMf-{k)\  {X?Ah) 


(3.21) 

(3.22) 
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=  fw*  -  1)  +  (3.25a) 

Wpi^ik)  =  fw-  -  1)  +  (3.256) 

A  procedural  issue  arises  that  is  not  encountered  in  the  original  ACP  network, 
where  the  weights  are  only  updated  at  the  end  of  a  triad.  The  dej^endence  of  the 
reinforcement  center  outputs  on  jmax(^)  requires  that  the  motor  center  outputs  be 
computed  first.  After  learning,  however,  the  motor  center  outputs  and  also  jv^af(k) 
may  have  changed,  resulting  in  the  facilitation  of  a  different  set  of  reinforcement  cen  ¬ 
ter  weights.  Therefore,  if  freight  changes  are  calculated  such  that  jmax{k)  changa?., 
these  weight  changes  should  be  implemented  and  the  learning  process  repeated  imtil 
.7m3s(Ar)  does  not  funl;her  change  this  time  step. 

In  general,  exploration  of  the  state-action  space  is  necessary  to  assure  global 
convergence  of  the  control  policy  to  the  optimal  policy,  and  can  be  achieved  by 
occasionally  randomly  selecting  jmax(k),  in-stead  of  following  reciprocal  inhibition. 
Initiating  new  trials  in  random  states  also  provides  exploratory  inibxmation. 


3.3  Motivation  for  the  Single  Layer  Architecture  of  the  ACP 


Tliis  section  describes  qualitative  obserrvations  from  the  application  of  the 
modified  two-layer  ACP  to  the  regulation  of  the  aeroeIa.stic  oscillator;  atiditional 
quantitative  results  .appear  in  §3.6.  In  tlri-s  enriroiunent,  the  modified  ACP  learning 
system  fails  to  converge  to  a  useful  coatrol  policy.  This  section  explaim  the  failure 
by  illustrating  severfd  characteristic-s  of  the  two-Ia-ver  iinplemeulauon  cf  the  ACP 
algoritbin  that  me  incompatible  with  the  apphcation  to  optima.!  control  problems. 
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The  obje  ctive  oi‘  a  reinforcement  learning  controller  is  to  construct  a  policy 
that,  when  followed,  maximizes  the  expectation  of  the  discounted  future  return.  For 
the  two-layci  ACP  network,  the  incremental  return  is  presented  as  distinct  cost  and 
reward  signals,  wliicli  stimulate  the  two  reinforcement  centers  to  learn  estimates  of 
the  expected  discounted  future  cost  and  expected  discount©!  futui’e  reward.  The 
optimal  policy  for  this  ACP  silgorithm  is  to  select,  for  each  state,  the  action  with 
the  largest  difference  between  estimates  of  expected  discounted  future  reward  and 
cost.  However,  the  two-layer  ACP  network  performs  reciprocal  inhibition  between 
the  two  reinforcement  centers  and,  therefore,  selects  the  control  action  that  either 
maximizes  the  estimate  of  the  expected  discoimted  future  reward,  or  minimizes  the 
estimate  of  the  expected  discounted  future  cost,  depeiiding  on  which  reinforcement 
center  wins  reciprocal  inhibition  Consider  a  particular  state-action  pair  evaluated 
with  both  a  large  cost  and  a  large  reward.  If  the  reward  is  slightly  greater  than  the 
cost,  only  the  large  reward  will  be  associated  with  this  state-  action  pair.  Although 
the  true  evaluation  of  this  state-action  pair  is  a  sroall  positive  return,  this  jaction  in 
this  state  may  be  iacon'ecily  selected  as  optimal. 

'rhe  reinforcement  center  learning  mechanism  incf*rpoiates  both  the  current 
aud  the  previous  outputs  of  the  reinforcement  center.  For  example,  the  pasitive 
reinforcement  center  leairiing  cqueticn  includes  the  term  Ayp(k),  given  in  (3.26), 
which  represents  the  error  in  the  estimate  of  the  expected  discomited  futxue  reward 
for  the  previous  state  y/>(  fc--i). 


Ayp(^’)  -  Typ(^)  -  yp(^  --  i)  +  'rt  {k) 


(3.26) 


A  rein.t'orx-f^meiit  cente,r  that  loses  the  reciprocal  iniubitioi:.  process  will  have  an  out- 
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put  equal  to  zero.  Consequently,  the  value  of  Ayp(k)  will  not  accurately  represent 
the  error  in  yp{k  —  l)  when  yp{k)  or  yp(k~  l)  equals  zero  as  a  result  of  recipro¬ 
cal  inhibition.  Therefore,  Ayp(k)  will  be  an  invalid  contribution  to  reinforcement 
learning  if  the  positive  and  negative  reinforcement  centers  alternate  winning  re¬ 
ciprocal  inhibition.  Similarly,  Ayi\[{k)  may  be  erroneous  by  a  parallel  argument. 
Moreover,  the  fact  that  learning  occurs  even  for  the  reinforcement  center  which 
loses  reciprocal  inhibition  assures  that  either  Ayp{k)  or  Ay^ik)  wall  be  incorrect 
on  every  time  step  that  a  motor  center  is  active.  K  no  motor  center  is  active,  no  set 
of  weights  betwe^en  the  acqxiired  drive  sensors  and  reinforcement  centers  are  facili¬ 
tated  and  both  reiirforcement  centers  will  have  zero  outputs.  Although  no  lesuning 
ocau-s  in  the  reiiu  rcement  centers  on  this  time  step,  both  Ayp  and  Ays  will  be 
incorrect  on  the  next  time  step  that  a  motor  center  is  active. 

The  difliculties  discussed  above,  which  arise  from  the  presence  of  two  com¬ 
peting  reinforcement  centers,  are  reduced  by  providing  a  non- zero  external  rein¬ 
forcement  sijpial  to  only  a  single  reinforcement  center.  However,  the  reinforcement 
center  whicli  receives  zero  external  reinforcement  will  occasionally  win  reciprocal 
inhibition  until  it  leexns  that  zero  is  the  correct  output  for  every  state.  Using  the 
sum  of  the  reinforcement  center  output  and  the  external  reinforcement  signal  as  the 
inptit  to  the  reciprocal  inhibition  process  may  guarantee  that  a  single  reinforcement 
center  will  always  win  reciprocal  inhibition. 

'Fhe  optimal  policy  for  each  state  is  defined  by  ihe  action  which  yields  the 
largest  expected  discounted  future  return.  The  ACP  network  represents  this  in- 

The  original  ACP  uses  this  technique  in  (3.4)  and  (3.5);  the  modified  two-layer  ACP 
eliminates  the  external  reinforcement  signal  from  the  reinfortr  aieiit  center  output  in 
(3.18). 
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formation  in  the  reinforcement  centers  and,  through  learning,  transfers  the  value 
estimates  to  the  motor  centers,  where  an  action  is  selected  through  reciprocal  inhibi¬ 
tion.  The  motor  center  learning  mechanism  copies  either  the  estimate  of  expected 
discotmted  future  cost  or  the  estimate  of  expected  discounted  future  reward,  de¬ 
pending  on  which  reinforcement  center  wins  reciprocal  inhibition,  into  the  single 
currently  active  motor  center  for  a  given  state.  Potentially,  each  time  this  state  is 
visited,  a  different  reinforcement  center  will  win  reciprocal  inhibition  and  a  different 
motor  center  will  be  active.  Therefore,  at  a  future  point  in  time,  when  this  state 
is  revisited,  reciprocal  inhibition  between  the  motor  center  outputs  may  compare 
estimates  of  expected  discounted  future  cost  with  estimates  of  expected  discounted 
future  reward.  This  situation,  also  generated  when  the  two  reinforcement  centers 
alternate  winning  reciprocal  inhibition,  invahdates  the  result  of  reciprocal  inhibition 
between  motor  centers.  Therefore,  the  ACP  algorithm  to  select  a  policy  does  not 
guareintee  that  a  complete  set  of  estimates  of  a  consistent  evaluation  (i.e.  reward, 
cost,  or  return)  will  be  compared  over  all  possible  actions. 

This  section  has  introduced  several  fundamental  hmitations  in  thv';  twolayer 
implementation  of  the  ACP  algorithm,  which  restrict  its  applicabUity  to  optimal 
control  problems.  By  redvicing  the  network  to  a  single  layer  of  learning  cen¬ 
ters,  the  resulting  architecture  does  uot  interfere  with  the  operation  of  the  Drive- 
Reinforcement  concept  to  solve  infinite-horizoD  optimization  problems. 
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3.4  A  Single  Layer  Formulation  of  the  Associative  Control 

'  Process 


The  starting  point  for  this  research  was  the  original  Associative  Control  Pro¬ 
cess.  However,  several  elements  present  in  the  origineil  ACP  network,  which  are 
consistent  with  the  known  physiology  of  biological  neurons,  aire  neither  appropriate 
nor  necessary  in  a  network  solely  intended  as  an  optimal  controller.  This  section 
presents  a  single  layer  formulation  of  the  modified  ACP  (Figure  3.5),  and  contains 
significantly  fewer  adjustable  parauneters,  fewer  element  types,  and  no  nonlinearity 
in  the  output  equation.  Although  the  physical  structure  of  the  single  layer  net¬ 
work  is  not  faithful  to  biologica.1  evidence,  the  network  retains  the  ability  to  predict 
classical  and  instrumental  conditioning  results  [13]. 

The  interface  of  the  environment  to  the  single  layer  network  through  m  input 
sensors  is  identical  to  the  interface  to  the  modified  ACP  network  through  the  ac¬ 
quired  drive  sensors.  A  single  external  reinforcement  signal  r(k),  which  assesses  the 
incremental  return  achieved  by  the  controller’s  actions,  replaces  the  distinct  reward 
and  cost  external  reinforcement  signals  present  in  the  two-layer  network. 

A  node  and  effector  pair  exists  for  each  discrete  netwurk  action.  The  output 
of  the  node  estimates  the  expected  discounter!  future  retima  for  performing 
action  j  in  the  cmrent  state  anc  subsequently  following  an  optimal  policy.  The 
smn  of  an  excitatory  and  an  inhibitory  weight  encode  this  estimate.  Constructed 
from  a  single  tyj)e  of  neuronal  element,  the  single  layer  ACP  architecture  requires 

.4  node  combines  the  functionality  of  the  motor  and  reinforcement  centers. 
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Sensor  inputs 


Action  Node® 


Effectorn 


Figure  3.5.  The  single  layer  ACF  architecture. 


only  a  single  linear  output  equation  mid  a  single  learning  equation. 


mi 

»,(*)  =  E  (»^>(*) + ^■(*)  (s-2’’) 

»=i 


The  optimal  policy,  to  niasriinize  the  expected  discormted  fiiture  return,  selects 
for  each  state  the  action  corresponding  to  the  node  with  greatest  output  Becip;roo.<d 
inhibition  between  the  n  nodes  defines  a  currently  active  asde  aimilnr  to 
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tbe  process  between  motor  centers  in  the  two-  layer  ACP.  However,  the  definition  of 
reciprocal  inhibition  has  been  changed  in  the  situation  where  multiple  nodta  have 
equally  large  outputs.  In  this  case,  which  represents  a  state  with  multiple  equally 
optimal  actions,  jfmai(^)  wiH  equal  the  node  with  the  smallest  index  j.  Therefore, 
the  controller  vnll  perform  an  action  and  will  learn  on  every  time  step. 

The  learning  equation  for  a  node  resembles  that  of  a  reinforcement  center. 
However,  the  absolute  value  of  the  connection  weight  at  the  time  of  the  state  change, 
which  was  removed  in  the  modified  ACP,  has  been  restored  into  the  leaniing  equa¬ 
tion  [13].  This  term,  which  v/as  originally  introduced  for  biological  reasons,  is  not 
essential  in  the  network  and  serves  as  a  learning  rate  parameter.  The  di.scount  fac¬ 
tor  7  describee  how  an  assessment  of  return  ia  the  futme  is  less  significant  than 
an  assessment  of  return  at  the  present.  As  before,  only  weights  associated  with  a 
state-a':tion  poir  being  active  in  the  previous  t  time  steps  fixe  eligible  for  change. 


-  1 )  + 

\jiik  —  a}) 

(3.28) 

...  ....  j  1  if  i ---■  ~  a)  jmd 

1 0  otherwise 

Xi{k  —  a)  -  I 

(3.29) 

-  /h  i 

K*(i 

1)  -f  AW^ik)\ 

(3.30«) 

-l)  -f  AlkiJ(k)j 
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3.5  Implementation 

The  modified  two-layer  ACP  algorithm  amd  the  single  layer  ACP  algorithm 
were  implemented  in  NetSim  and  evaluated  as  regulators  of  the  AEO  plant;  fim- 
damentaJ  limitations  prevented  a  similar  evaluation  of  the  original  ACP  algorithm. 
The  experiments  discussed  in  this  section  and  in  §3.6  were  not  intended  to  repre¬ 
sent  an  exhaustive  analysis  of  the  ACP  methods.  For  several  reasons,  investigations 
focused  more  heavily  on  the  Q  learning  te<dmique,  to  be  introduced  in  §4.  First, 
the  ACP  algorithms  can  be  directly  related  to  the  Q  learning  algoritlun.  decond, 
the  relative  functional  simplicity  of  Q  learning,  which  also  possesses  fewer  free  pa- 
rametf*rs,  facilitated  the  analysis  of  general  properties  of  direct  learning  techniques 
applied  to  optimal  contr  ol  problems;. 

This  section  details  the  implementation  of  the  ACP  reinforcement  learning 
algorithms.  The  description  of  peripheral  features  that  are  common  to  both  the 
ACP  and  Q  leaining  emdronments  will  not  be  repeated  in  §4.5. 

The  requirenjent  that  the  learning  algoiithra’s  input  space  consist  of  a  finite 
set  of  disjoint  states  nr^essitatfxi  a  BOXES  [8j  type  algorithm  to  quantize  the  con¬ 
tinuous  dynamic  state  information  that  was  generated  by  the  simulation  of  the  AEO 
equations  of  motion.  As  a  result,  the  input  space  was  divided  into  200  discrete 
states.  The  20  angular  boundaries  occurred  at  18®  intervals,  .«tar  ing  at  0°;  the  9 
boundaries  in  magnitude  cKcwred  at  1.15,  1.0,  0.85,  0.7,  0.5-5,  0.4,  0  3,  0.2,  and  0.1 

'■*  The  terms  bins  aueJ  discretf  states  are  interpreted  aynonymously.  The  aerwlastic 
oscilJator  has  two  state  variables;  peB,iticn  and  velocity.  The  measurement  of  these 
variables  in  the  space  of  continuous  .real  numbers  will  be  referred  to  es  the  dynamic 
state  or  continuous  state. 


412 


ATIACHMENT  3 


3.S  Lnplementation 

non<ijmejfisional  units;  the  outer  annulus  of  bine  did  not  have  a  niaximiun  liinit  on 
/  the  magnitude  of  the  state  vectors  that  it  contained. 

The  artificial  definition  of  time  steps  as  the  non-uniform  intervals  between 
>  entering  and  leaving  bins  eliminates  the  significance  of  r  as  the  longest  interstimulus 

interval  over  which  delay  conditioning  is  effective. 

The  .A  CP  learning  control  system  was  limited  to  a  finite  number  of  discrete 
outputs:  4-0.5  and  —0.5  nondimeasional  force  units. 

The  learning  algorithm  operated  through  a  hierarchical  process  of  trials  and 
(‘.xperimenU.  Each  experiment  consisted  of  numerous  triads  and  began  with  the  ini¬ 
tialization  of  weighs  and  counters.  Batch  trial  began  with  the  raindom  initiadization 
of  the  state  variables  .nd  ran  for  a  specified  length  of  time.  In  the  two-layer  archi¬ 
tecture,  the  motor  center  amd  reinforcement  center  weights  were  ramdomly  initialized 
using  unifo-m  distributions  between  {—1.0,— of  and  {a,  1.0).  In  the  single  layer 
architecture,  all  excitatoiy  weights  were  initialized  within  a  smell  uniform  ramdom 
deviation  of  1.0,  and  all  inhibitory  weights  were  initialized  within  a  smadl  uniform 
random  deviation  of  —a.  The  impetus  for  this  scheme  was  to  originate  weights  suf¬ 
ficiently  large  such  that  learning  with  non-positive  reinforcement  (i.e.  z<jro  reward 
and  non-negative  cost)  would  only  decrease  the  weights. 

The  kairning  system  operates  in  discrete  time.  At  every  time  step,  the  dy- 
njimic  state  transitions  to  a  new  value  either  in  the  same  bin  or  in  a  new  bin  and 
the  system  eva.l.iates  the  current  assessment  of  either  cost  and  reward  or  reinforce¬ 
ment.  f\>T  each  discrete  time  step  that  the  state  remaii  .s  in  a  bin,  the  rcinforceme,ivl 

Initi.al  states  (position  aisd  velocity)  were  imifornily  distributer!  between  -  i.2  and 

4  1.2  no«dimensio.H;J.  anits. 
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Tkble  3.1.  ACP  parameters. 


Name 

1 

Symbol 

Value 

Discount  Factor 

7 

0.95 

Threshold 

e 

0.0 

Minimum  Bound  on  \W\ 

a 

0.1 

Maximum  Motor  Center  Output 

1.0 

Maximum  Interstimulus  Interval 

r 

5 

accumulates  as  the  sum  of  the  current  reinforcement  and  the  accretion  of  previoms 
reinforcements  discounted  by  7.  The  arrows  in  Figure  3.6  with  arrowheads  lying  in 
Bin  1  represent  the  discrete  time  intei  vals?  that  contribute  reinforcement  to  learning 
in  Bin  1.  I.ieaming  for  Bin  1  occurs  at  where  the  total  riinforcement  equals  the 
sum  of  Ts  and  7  times  the  total  reinforcement  at 


Figure  3.0.  A  state  transition  and  reinforcement  accumulation  car¬ 
toon. 


For  the  two- layer  ACP,  the  reward  presented  to  the  positive  reiiiforce^inent 
center  was  zero,  while  the  cost  presented  to  the  negative  leinforcement  center  wa.s 
a  quadratic  valuation  of  the  state  error.  In  tlie  single  layer  learning  architecture. 
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the  quadratic  expression  for  the  reinforcement  signal  r,  for  a  single  discrete  time 
'  interval,  rvas  the  negative  of  the  product  of  the  square  of  the  magnitude  of  the 

state  vector,  at  the  final  time  for  that  intervEil,  and  the  length  of  that  time  interval. 
»  The  quadratic  expression  for  cost  in  the  two-layer  ACF’  was  —r.  The  magnitude  of 

the  control  expenditure  was  omitted  from  the  reinforcement  function  because  the 
contribution  was  constant  for  the  two-action  control  laws. 

r  ^  -(h  ~  h)  (3-31) 


3.6  Results 


Figure  3.7  illustrates  a  typical  seg^nent  of  a  trial  prior  to  learning,  in  which 
an  ACP  learning  system  regulated  the  AEG  plant;  the  state  trajectory  wandered 
clockwise  around  the  phase  plan.',  suggesting  the  existence  of  two  stable  limit  cycles. 

I'he  modified  tw'o-  layer  .ACl’  .system  faikxl  to  l<*arn  a  control  law  whicii  drove 
tilt'  state  from  an  axthtrary  iiutiid  condition  to  tlse  origin.  Inste.-Ad,  the  learned 
control  law  produced  trajectiu  les  with  unacceptable  behavior  neej-  the  origin  (Figiue 
3.8).  The  terminal  condition  for  the  .AEG  .state  controlled  by  an  ojiiiuial  r'-gulator 
with  a  finite  uumlier  td  discrete  control  levels,  is  a  limit  cycle,  licwev.-j ,  the  two 
layer  A(,'i’  faih'd  to  converge  to  tfie  oj>timal  control  jiohcy,  .Altliough  the  abs, -nee  of 
;i  set  uf  leaiiiiug  j'a.rauicteis  for  which  the  algoiitiun  wouid  c.c.'nveige  t>,.  an  optiuud 
si)!nt  ion  catmot  i'»e  easily  deiiitiustratei.  §3.3  cleaiiy  irientitios  s'  vcral  uiulesirable 
proj-erties  of  the  algontiin:.. 
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Position 

Figure  3.7.  A  characteristic  AEO  state  trajectory  achieved  by  a 
reinforcement  learning  algorithm  prior  to  learning. 


The  single  layer  architectur  e  of  the  ACP  learned  the  optimal  control  law,  which 
successfully  regulated  the  AEO  state  variables  near  ze^o  firom  any  iiutial  condition 
within  the  region  of  training,  {  —  1.2, 1.2).  The  performance  of  the  coiitrol  policy 
was  limited  by  the  coarseness  of  the  bins  and  the  proximity  of  bin  boundaries  to 
features  of  the  nonlineai  dynamics.  The  re.stricted  dioi'ce  of  control  actions  al.so 
bounds  the  acliie^’able  performance,  contributing  to  ’.he  rough  trajectory  in  Figure 
3.9, 


416 


Velocity  Velocity 


ATTACHMENT  3 


$.8  Kifisults 

1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
•0.2 
•0.4 

-1.0  -0.5  0.0  0.5  1.0 

Position 

rigjjTe  3.8.  The  AEO  state  trajectory  achieved  by  the  modified 
two-layer  ACP  after  learning. 

1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
-0.2 
-0.4 
-0.6 

-1.0  -0.5  0.0  0.5  1.0 

1  osiiion 

Eigure  3.9.  The  AE(  state  ti  ajectory  acliieved  by  tlie  single  layer 
ACP  after  learning. 
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/' 


The  notation  an^  r-oncepts  presented  in  §4.1  through  §4.4  follow  directly  &om 
Watkins’  thesis  [16]  and  (17).  §4.5  and  §4.6  present  results  of  applying  Q  learning 
to  the  AEO.  §4.7  explores  a  continuotxs  version  of  Q  learning. 

4.1  Terminology 


4.1.1  Total  Discounted  Future  Return 

A  discrete- time  system  that  performs  an  action  u*  in  a  state  x/t,  at  time  i, 
receives  a  performance  evaluation  r*  associated  with  the  traufHion  to  the  state 
at  time  1  -f  1;  the  evaluation  Vk  is  referred  to  as  the  return  at  time  k.  ^  The  ioial 
future  return  after  time  k,  which  equals  the  sun.i  of  the  returns  assessed  between 
time  k  and  the  completion  of  the  problem,  may  be  uirbounded  for  an  infinite 

’  Watkins  defines  neftirn  as  the  toiai  discounted  future  reuniru:  this  pape.,r  equates  the 
terms  nitum  and  rt  umrd. 
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horizon  problem.  However,  the  return  received  in  the  distant  future  is  frequently 
less  important,  at  the  present  time,  than  contemporary  evaluations.  Therefore, 
the  total  discounted  futur-e  return,  defined  in  (4.1)  and  guaranteed  to  be  finite,  is 
proposed. 

OO 

7"f'A+n  =  +  7r*+.!i  +  7Vifc+2  +  .  ■  •  +  +  •  •  •  (4.1) 

n=::0 

i 

The  discount  factor,  C  <  7  <  1,  determines  the  present  value  of  futm-e  returns. 

4.1.2  The  Markov  Decision  Process 

A  non-absorbing,  finite-state,  finite-action,  discrete  time  Markov  decision  pro¬ 
cess  is  described  by  a  bounded  set  of  states  S,  a  countable  set  of  actions  for  each 
state  A{x)  where  x  €  S,  a  transition  function  T,  and  an  evaluation  medianism  R. 
At  time  k,  the  state  is  designated  by  a  raniora  variable  Xt  and  the  true  ’alue  x*. 
The  transition  function  defines  Xk+t  =  T(xk,Uk)  wheie  c*  €  A(xi);  the  new  state 
must  not  equal  the  previous  state  with  probability  equal  to  unity  At  time  k,  the 
return  is  denoted  by  a  random  variable  Rk  R{xk,ak)  ai?d  the  actual  evaluation 
Tk-  The  expectation  of  the  return  :.*!  written  Ilk  ■  The  Mrirkov  pioperty  implies  the.t 
the  transition  and  evaluation  functions  depend  on  the  current  state,  and  cumnt 
action,  and  do  not  depend  on  previous  states,  actions,  or  evaluations. 

4.1.3  Value  Function 

in  a  Markov  decision  process,  the  expectation  of  the  total  discounted  future 
relvi.ru  deperid.s  only  or',  the  current  state  and  the  stationary  policy.  A  convenient 
.notation  for  the  probability  that  performing  action  a  i.n  state  .1  will  leave  the 
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system  in  state  y  is  Pry(<i).  The  random  variable  representing  the  future  state 
Xii^n  achieved  by  starting  in  the  state  r*  at  time  k  and  following  policy  /  for  n 
time  steps  is  wiitten  as 

X(xt,f,0)^x,  (4.2a) 

Xix,J,l)  =  =  T(xkJixk))  (4.26) 

K  policy  /  is  followed  for  n  time  steps  from  state  r*  at  time  k,  the  return  realized 
for  applying  f(xie+n)  instate  x^+n  ir.  expressed  as  R(xkyf,n). 

=  R(xk,  f{xk))  ■=  Rk  (4.3a) 

RiXkJ,  n)  =  R{Xk+nJiXk+n))  -  Rk+n  (4.36) 

The  expected  total  discounted  futuie  ret»irn  subsequent  to  the  state  x,  applying 
the  invaiifint  policy  /,  is  the  value  function  V/{x). 


Vf(x)  —  R(x,f,0)  -}-  -yRlx^f,  1)  -f  .  -  .  4-  7”i'?(x,/,  n)  +  . . .  (4.4a) 

RU~m  +  iVj  (ITriXr) j  (4.46) 

..  7iri;7:0)  +7E>'’xv(/(^))^f(y)  (4.4c) 


In  (4.4c).  y  ’s  the  subset  of  S  that  is  reachable  from  x  in  a  single  time  step. 


4.1, .4  Action  \'ali'(e 

The  actton-vahu  Qf(x,a)  is  the  typectatioii  oi  the  tota.i  discounted  future 
return  for  starting  iu  state  x,  perfonning  action  a,  and  subsequently  follo»i,tig 
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policy  /.  Watkins  refers  to  action-values  as  Q  values.  A  Q  value  represents  the 
same  information  as  the  siun  of  an  excitatory  weight  and  an  inhibitory  weight  in 
Drive- Reinforcement  learning,  which  is  used  in  the  single  layer  ACP. 

Q/{x,  i)  --  Jt(x,  a)  -I-  7 13  (4-5) 

V 

The  expression  for  an  action- value  (4.5)  indicates  that  the  value  function  for  policy 
/  must  be  completely  known  prior  to  computing  the  action-values. 

Similarly,  Q]{x^g)  is  the  expected  total  discoimted  fiiture  return  for  starting 
in  X,  performing  action  g{x)  according  to  policy  g,  and  subsequently  following 
policy  /. 

4.2  Policy  Iteration 

The  Policy  Improvement  Theorem  [16]  states  that  a  fHjIicy  g  is  uniformly 
better  than  or  equivalent  to  a  policy  /  if  and  only  if, 

^  k/(x)  for  all  x  e  S.  (4.6) 

This  theorem  and  the  definition  of  action- values  imply  that  for  a  policy  g  which 
sati.sties  (4.6),  >  V}(x)  for  all  i  €  S.  The  Policy  Improt>emeni  Algorithm 

sel<!c*s  an  improved  policy  g  according  to  the  nde-  g{x)  --  a  G  A(x)  such  that 
<1  is  the  argument  that  msiximizes  Hc-wever,  to  determine  the  action- 

values  Qj{x,a)  for  /,  the  entire  value  function  Vy(x)  must  first  be  calculated.  In 
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the  context  of  a  finite-state,  finite-action  Markov  process,  policy  improvement  will 
terminate  after  applying  the  algorithm  a  finite  number  of  times;  the  policy  g  will 
converge  to  an  optimal  policy. 

The  Opiimaliiy  Theorem  [16]  describes  a  policy  /*  which  cannot  be  improved 
using  the  policy  improvement  algorithm.  The  associated  value  function  Vf{x)  and 
action- values  Qf(x,a)  satisfy  (4.7)  and  (4.8)  for  all  x  €  5.  j 

V^.(x)  =  m^  Q/.(x,a)  (4.7) 

a^A{x) 

/*(i)  =  a  such  that  Qf(x,a)  =  V}.(x)  (4.8) 

The  optimal  value  function  and  action-values  are  unique;  the  optimal  policy  is 
unique  except  in  states  for  which  several  actions  yield  equal  and  maximizing  action- 
values. 


4.3  Value  Iteration 


The  value  iteraUon  [16]  proojdure  calculates  an  optimeil  policy  by  choosing  for 
each  state  the  action  wliich  effects  a  transition  to  the  new  state  that  possesses  the 
maximum  evaluation;  the  optimal  viJue  function  determines  the  evaluation  of  each 
state  that  succeeds  the  cunent  state.  The  expected  total  discount*id  fut\.ue  return 
for  a  finite  horizon  process  which  consists  of  n  transitions  and  a  subsequent  final 
return,  to  tA'a.lua.te  the  terminal  state,  is  reprt^sented  as  V"  .  The  value  functio 
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which  corresponds  to  the  infinite  horizon  problem,  is  approximated  by  repeatedly 
applying  rule  (4.9)  to  an  initial  estimate  V®. 

=  max  R(xya) -i- y'^Fx:y(a)V^~^(y)  (4,9) 

L  y  J 

Veilue  iteration  guarantees  that  the  limit  in.  (4.10)  approacbics  zero  uniformly  over 
all  states.  Therefore,  V"  converges  to  the  optimal  value  fi'mction  and  the  optimal 
policy  can  be  derived  directly  from  VJ‘. 

^\V^-Vr\  =  0  (4.10) 

Although  this  procedure  is  computationally  simplest  if  all  states  are  systematicaJly 
updated  so  that  V"  is  completely  determined  from  V""*  before  V'**"*"^  is  calculated 
for  2my  state,  Watkins  has  demonstrated  that  the  value  iteration  method  wiU  still 
converge  if  the  values  of  individual  states  Eue  updated  in  an  arbitrary  order,  provided 
that  all  states  are  updated  sulEciently  frequently. 

4.4  Q  Learning 

Unfortimately,  neither  the  optimal  policy  nor  optimal  value  function  can  be 
initially  known  in  a  arntrol  problem.  Therefore,  the  learning  proces.s  involves  si¬ 
multaneous,  incremental  improvements  in  both  the  policy  function  and  the  value 
function.  Action- values  Q;^(x*,a*)  for  each  state- action  pair  at  time  k  contain 
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both  policy  und  value  inibnnation;  the  jjolicy  and  value  fuxictioM  at  time  k  are 
defined  in  (4.11)  and  (4.12)  in  tenna  of  Q  values. 

/^(x)  =  a  such  that  (^*(x,  a)  ™  Vjp(x)  (411) 

Vj^(x)  —  m»x.[Qic(x,a)]  (4.12) 

The  superscript  Q  denotes  the  derivation  of  the  policy  and  the  \alue  function  from 
the  set  of  action-values  Qf^(xk,ak).  Single  step  Q  learning  adjusts  the  actio.u.-'i'alues 
according  to  (4.13). 

Qk+i(xk,ak)  -  (1  -•  a)Qk(xk,ak)  +  cx{rk  +  7^(*fc+i))  (4.13) 

The  positive  learning  rate  constant  a  is  less  chan  unity.  Only  the  action- value 
of  the  state-action  pair  (xk,ak)  is  altered  at  time  k;  to  guarantee  convergence  of 
the  value  function  to  the  optimal,  each  action  must  be  repeatedly  perfojmed  in 
each  state.  As  a  form  of  dynamic  programming,  Q  learning  may  be  desaibed  as 
incremental  Monte-Carlo  cnlue  iteration. 


4.5  Implementation 

This  .sc^ction  formalize*  the  implement'vtion  of  the  Q  leiuning  algorithm  as  a 
regulator  for  the  a^jroelastic  oscillator  plant.  The  enviiorimeat  exteruai  to  the  Q 
learning  process  wiis  sii'.iii(.i.ar  to  that  used  for  the  A(,1P  ex{>erjm?.;uts  j'-.',  §3.5  a-r.id  §3, .6. 
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However,  the  qt3ianti/.ation  of  the  state  space  was  altered.  The  boundaries  of  the 
260  i>ins  that  covensd  the  state  space  v/ere  defined  by  magnitudes  M  and  angles 
j4;  the  outer  annulus  of  bins  'iid  not  have  a  maximum  magnitude. 

M  {0.0,  0.05,  0.1,  0.15,  0.2,  0.25,  0.3,  0.35,  0.4,  0.5,  0.6, 

0.7,  0.a5,  1.0} 

A  ==  {0“,  18%  36“,  54%  72%  S0%  108%  126%  144%  162%  180%  198% 

216%  234%  252%  270%  288%  306%  324%  342“} 

The  bins  were  labeled  with  integer  .numbers  from  0  to  259,  starting  with  the  bins 
in  the  outer  ring,  within  a  ring  increasing  in  index  with  mcreasing  angle  firom  0“, 
and  continuing  to  the  next  inner  ring  of  bins. 

For  each  state-action  pair,  the  Q  learning  algorithm  stores  a  real  number  that 
represents  the  Q  value.  At  the  start  of  a  new  NetSim  '•xpeiiment,  all  Q  values  were 
initialised  to  zero. 

The  two  parameters  which  appear  in  (4.4.3)  were:  7  =;  0.35  and  q  =  0.5.  In 
this  context,  a  is  a  learning  rate  parameter;  in  the  ACP  desciipt'on,  a  was  the 
minimum  bouiid  or.  the  ab-solute  value  of  the  weights.  The  retmn  was  given  in 
(3.31)  as  the  negative  of  the  product  of  the  squared  magnitude  of  the  state  vector 
and  the  length  of  the  time  mter*=^al. 


4.6  Eetnilts 


The  o,  two  experiments,  conducted  in  the  NtiSim  [11]  enviroaxnent, 

chara-cteri;!:e  the  performance  of  the  Q  learning  algorithm.  The  two  experiiiients 
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Figure  4.1a.  A  Cartesian  representation  of  the  two-action  optimal 
contvol  policy. 


diifer  in  the  set  of  allov^ahle  wjntroi  actions. 

Experiment  1:  Uk  €  {0.5,  ---0.5} 

Experiment  g:  Ufc  €  {0.5,  0.33,  0.167,  0.0,  -0.167,  -0.33,  -0.5} 

The  learned  optimal  policy  for  Experiment  7  app  '^ars  in  Figiires  4.1a  and  4.1b. 
The  control  law  applied  a  -f-0.5  force  v/henever  the  state  i,eside€{  in  a  bin  contfiimag 
a  4-  and  applied  —0.6  whenever  the  state  wjt.s  in  an  empty  bin  ,  The  general  fonn  of 
thi.s  control  policy  resemble.s  the  .uon- optimal  bang- bang  Jaw  that  wiis  derived  from 
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Figure  4.1b.  A  polar  representation  of  the  two  action  optimal  con¬ 
trol  policy. 


a  LQR  solution  in  §2.4.3.  Figure  2.9  demonstrated  that  for  the  riOEi-optirnal  bang- 
bang  <  oatrol  policy,  the  st  ate  trajectory  slowly  approachtxl  the  origin  along  a  linear 
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Figure  4.3.  Experiment  S:  Expected  riiscoimted  futui-e  return  (Q 
value)  for  eada  state-action  pair. 

288°.  The  15  pairs  of  bins  wliich  violate  this  pattern  lie  primarily  near  the  linear 
switching  curve  (§2.4.3). 

F'igvires  4.2  and  4.3  compare  the  expectt^il  discounte^i  future  returns  for  all 
state- action  p?urs  in  ExpeninfriU  1  and  S,  respectively.  Th»'  exf>ected  discounted 
future  return  was  negative  for  all  state  action  pairs  because  only  negative  return 
(i.c.  cost)  was  asse.s.sed.  Mortxvver,  the  Q  toIuc's  for  bins  nearer  the  csrigin  were 
greater  than  the  Q  values  for  outlying  bin.s.  The  fact  thf;t  a  non  optimal  action 
perfoi  uifd  in  a  .singh-  l.>iu  diHrs  not  signih  'antly  alfect  the  total  cost  for  a  trajr'ctory, 
win  n  ootuaru  Hctu.)ns:  ;u'e  folli.>w<-d  m  r.U  otliei  hin:.;  (in  iliis  piohle.n),  explains  tla* 
.siuiilaiitv  lietwerii  most  Q  value.s  fiss'.H’iated  with  diilercn!  actirnis  and  ti.ie  same 
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state.  Additionally,  the  Q  values  varied  almost  periodically .  as  a  function  of  the 
bin  niunber;  the  largest  variance  existed  for  the  bins  farthest  from  the  origin.  All 
trajectories  tended  to  approach  the  origin  along  the  same  paths  through  the  second 
and  fourth  quadrants  (Figures  4.4,  4.6,  4.8,  and  4.10).  Therefore,  if  an  initial 
condition  was  such  that  the  linaited  control  authority  cordd  not  move  the  state  onto 

the  nearest  path  toward  the  origin,  then  the  trajectory  circled  halfway  around  the 

/' 

state  space  to  the  next  path  toward  the  origin.  This  characteristic  was  a  property 
of  the  AEO  nonlinear  dynamics,  and  accounted  for  the  large  difrerences  in  Q  values 
for  neighboring  bins.  In  Experiment  1,  there  existed  bins  for  which  the  choice  of  the 
initial  control  action  determined  whether  this  circling  was  necessary.  For  these  bins, 
the  exp<ected  discounted  future  returns  for  the  two  actions  differed  substantially. 

'Hie  control  law  constructed  in  Experiment  t  was  expected  to  outperform 
the  control  law  constructed  in  Experiment  f  (i.e.  for  each  bin,  the  maximum  Q 
value  from  Figure  4.3  would  exceed  the  maxiraura  Q  value  from  Figiue  4.2).  For 
the  data  presented,  this  expectation  is  tnaa  for  60%  of  the  bins.  The  bins  that 
violate  this  prediction  are  entirely  located  in  the  regions  of  the  state  space  that  the 
state  enters  least  frequently.  Experiment  with  a  greater  mnriber  of  state-action 
pairs,  requirefi  substantially  more  training  than  Experiment  L  The  fact  that  for 
certain  hu:S,  the  maximum  Q  value  from  Experimtat  1  exceeds  that  for  Experiment 
£  signals  ini  afScient  learning  in  those  bins  for  Experiment  £. 

No  exx'Iicit  search  mechanism  was  employed  during  learning.  Moreover,  the 
<iynan;iio=  tended  to  force  aii  trajectori{5s  onto  the  same  paths,  so  that,  many  bins 
were  eltloni  entered.  Therefore,  to  aissure  that  a  giob-ally  optimal  policy  was  at- 
Itiiued,  .sn'Hicient  trifsJs  were  required  so  that  the  rarjdom  selection  ol  the  initial 
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state  provided  sufficient  experience  about  performing  every  action  in  every  bin. 
Over  2000  trials  were  perfortned  in  each  experiment  to  train  the  learning  system. 
If  the  learning  rate  had  been  a  focus  of  the  research,  an  explicit  search  procedure 
could  have  been  employed.  Additionadly,  in  some  experiments,  the  Q  values  did  not 
converge  to  a  steady  state.  Some  of  the  bins  were  excessively  large  such  that  the 
optimal  actions  (in  a  continuous  sense)  associated  with  extreme  points  within  the 
bin  were  quite  different.  Therefore,  the  Q  values  for  such  a  bin,  and  subsequently 
the  optimal  policy,  would  vary  as  long  as  training  continued. 

All  Q  leru  ning  experiments  learned  a  control  ]>olicy  that  successfully  regulated 
the  aeroelastic  oscillator.  The  state  trajectories  and  control  histories  of  the  AEIO, 
with  initial  conditions  {—1.0, 0.5}  and  (1.0, 0.5},  which  resulted  from  the  control 
laws  learned  in  Experiments  1  and  f,  appear  in  Figiues  4.4  through  4.11.  The  lim¬ 
itation  of  the  control  to  discrete  levels,  and  the  associated  sharp  control  switching, 
resulted  in  rough  state  trajectories  as  well  as  limit  cycles  around  the  origin.  The 
results  illustrate  the  k’.portance  of  a  smooth  control  law;  a  continuous  control  law 
(LQR)  was  discussed  in  §2.4.2  and  a  characteristic  state  trajectory  appeared  in  Fig¬ 
ure  2.5.  The  absence  of  actuatox  dynamics  and  a  penalty  on  the  magnitude  of  the 
control  allow  the  application  of  larger  values  of  control  to  maximize  reinforce;n(ait. 
Therefore,  Experiment  8  seldom  selected  a  smaller  or  zero  control  force,  even  for 
bins  rear  the  origin.  In  Experiment  1  the  magnitude  of  the  control  was  constant. 
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4.7  Continuous  Q  Learning 


The  algorithm  described  in  §4.4  operates  with  both  a  finite  set  of  states  and 
discrete  control  ew:tions.  The  optimal  control  a“  miodmizes  Q(x,a*)  for  the  current 
state  X.  To  identify  the  optimal  control  for  a  specific  state,  therefore,  requires  the 
comparison  of  Q(x,a)  for  each  discrete  action  a  €  A(x).^  However,  quantization 
of  the  input  and  output  spaces  is  seldom  practical  or  acceptable 


Control 

Figure  4.12.  A  continuous  Q  function  for  an  arbitrary  state  x. 


A  potential  new  algorithm,  related  to  the  Q  leaining  p!t>cess  of  §4.4  might  se* 

^  A  finite  number  of  Q  values  exist  and,  therefore,  the  maximum  Q  value  is  easily 
obtained. 
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lect,  for  each  discrete  state,  the  optimal  control  action  from  a  bounded  continuum 
and  employ  a  continuous  Q  function  that  maps  the  control  levels  into  evaluations 
of  the  expected  discounted  future  return  (Figure  4.12).  However,  to  identify  the 
optimal  control  for  a  state  requires  the  maximization  of  a  potentially  muiti-modal 
bounded  function;  this  extremization  procedure  is  problematic  relative  to  the  max¬ 
imization  of  discrete  Q  values.  The  maximization  of  a  multi-modal  function  at  each 
stage  in  discrete  time  is  itself  a  complicated  optimization  problem  and,  although  not 
intractable,  makes  any  continuous  Q  learning  procedure  impr£u:tical  for  real-time, 
on-line  applications.  This  Q  learning  algorithm  directly  extends  to  incorporate 
several  control  variables;  the  optimal  controls  for  a  state  are  the  arguments  that 
maximize  the  multidimensional  Q  function. 

The  Q  learning  concept  may  be  further  generalized  to  eLuploy  continuous  in¬ 
puts  and  continuous  outputs.  The  algorithm  maps  expectations  of  discounted  future 
returns  as  a  smooth  function  of  the  state  and  control  variables.  Th?  cuixent  state 
will  define  a  hyperplane  through  this  Q  function  that  resembles  Figure  4.12  for  a 
single  control  vatiable.  Again,  a  maximization  of  a  potentially  multi-  modal  func¬ 
tion  is  required  to  compute  each  control.  Although  the  continuous  natime  of  the 
state  inputs  does  not  operationally  affect  the  identification  of  an  optimal  control, 
the  mapping  and  learning  mechanisms  must  incorporate  the  local  generalization  of 
information  with  re.spect  to  state,  a  phenomenon  wliich  doe.s  not  occur  for  discrete 
state  bins.  A  continuous  Q  function  could  be  represented  by  any  function  approxi 
mation  scheme,  such  as  the  spatially  localized  connectionist  network  introduced  in 
§6- 

Baird  j42]  addressed  the  difScuity  of  determining  the  global  maximum  of  a 
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multi-modal  function.  Millington  [41]  proposed  a  direct  learning  control  method 
that  used  a  spatially  localized  fx>nnectionist  /  Analog  Lerming  Element.  The 
learning  system  dehned,  as  a  dit:«tiibuted  function  of  state,  a,  continuous  probability 
density  function  for  control  selection. 


Temporal  difference  (TD)  methodis  ccmpnse  a  ciars  of  mcrcmenteJ  learning 
prccedures  that  predict  fiit^ire  system  behavior  as  t,  function  of  current  observa¬ 
tions,  The  earliest  teniporal  difference  algorithm  appeared  in  Samuel’s  checher- 
playing  j  -ogTam  |18].  ^  Manifestations  of  the  TD  algoiithm  nlso  exist  in  Ho, Hand’s 
buchet  bri.gade  [13],  Sutton’s  .Adaptive  Heuristic  Critic  [o,29|,  and  Klopt'a  Drive- 
Reinforcement  iedining  [12].  This  chapter  summarizes  Sutto.j  s  Tnuficatioi  of  t,he»e 
rJgorithms  into  a  general  temporal  difference  theory  [15]  and  then  ’inaiyex^a  the  simi- 
ias'ities  and  distinctions  betA-een  the  Adaptive  Heuristic  Critic,  Drivt'  -iiei.’ifcv: xeme:ot 
learning,  and  Q  lee.rmog 

^  T,he  phrase  ternpomi  Hiffersnet  was  prerosed  by  Sutton  'u  1988  [  !5j. 
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5.1  TD(a)  Learning  Procedures 


Most  problems  to  which  learning  methods  are  applicable  can  be  foimulated 
as  a  prediction  problem,  where  future  system  behavior  must  be  estimated  from 
transient  sequences  of  available  sensor  omputs.  Conventional  supervised  learning 
prediction  methods  associate  an  observation  and  a  Snal  outcome  pair;  after  train¬ 
ing,  the  learning  system  will  predict  the  final  outcome  that  corresponds  to  an  input. 
In  contrast,  temporal  difference  methods  examine  temporally  successive  predictions 
of  the  final  result  to  derive  a  similar  mapping  from  the  observations  to  the  final 
outcome.  Sutton  demonstrates  that  TD  methods  possess  two  benefits  relative  to 
supervised  learning  prediction  methods  [15].  Supervised  learning  techniques  must 
wait  until  the  final  outcome  has  ueen  observed  before  performing  learning  calcula¬ 
tions  and,  therefore,  to  correlate  each  observation  with  the  final  outcome  requires 
storage  of  the  sequence  of  observation?)  that  preceded  the  final  result.  In  contrast, 
the  TD  approach  avoids  this  storage  re<iiiirement,  incrementaJly  learning  as  each 
new  prediction  and  observation  are  made  This  fact,  and  the  associated  temporal 
distribution  of  required  calculations,  make  the  TD  algorithm  umeuable  to  running 
on  -line  with  the  physical  plant.  Through  more  efficient  use  of  experience,  TD  al- 
gorithx!i.s  converge  more  rapidly  and  to  more  accurate  predictions.  Although  any 
learning  method  shoxild  converge  to  an  equivalent  evaluation  with  infinite  expe- 
TD  metliods  are  guaranteed  to  perfoim  better  than  tsupervised  learning 
teciaiiqiies  after  lisnited  experience  wiih  a  Mai'kov  decision  proexjss. 

'T’mporal  difference  and  conveiffional  supervised  learning  are  indistiriguishable 
■vor  single  step  p'Otiitiion  problems  where  the  accuracy  of  a  prediction  is  revealed 
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immediately.  In  a  multi-step  prediction  problem,  partial  information  pertinent  to 
the  precision  of  a  prediction  is  incrementally  disclosed  through  temporedly  suc¬ 
cessive  observations.  This  second  situation  is  more  prevalent  in  optimal  control 
t  problems.  Multi-stage  problems  consist  of  several  temporally  sequential  observa¬ 

tions  {xi,  xj,  ...,  Xto}  followed  by  a  final  result  z.  At  each  discrete  time  <,  the 
learning  system  generates  a  prediction  Pt  of  the  final  output,  typically  dependent 
on  the  current  values  of  a  weight  set  w.  The  learning  mechanism  is  expresvsed  as  a 
rule  for  adjusting  the  weights. 

Tj’pically,  supervised  learning  techniques  employ  a  generalization  of  the 
Widrow-Hoff  rule  [21]  to  derive  weight  updates.  * 

Awt  —  a(z  —  Pt)Au,Pt  (5.1) 

In  contrast  to  (5.1),  TD  methods  are  sensitive  to  clianges  in  successive  predictions 
rather  than  the  error  between  a  prediction  and  the  final  outcome,  Sutton  has 
demonstrat(Kl  that  for  a  m\ilti-step  pretliction  problem,  a  TD(l)  algorithm  produces 
the  same  total  weight  chemges  for  any  observation  outoxime  sequence  as  the  Widrow- 
Hoff  procedure.  The  TD(1)  algoritlim  (5.2)  alters  prior  prediciious  to  an  e^uai 
degree. 

Au;,  o(Ps+x  -  i'V)  E  (5-2) 

k-.\ 

The  temporal  differerore  method  generalii'*?s  frorn  T  D(  1 )  to  arr  algorithm  tiial  adjusts 

jinoi'  pre  Jj('tif-us  in  pr'opfirtioii  to  a  factor'  that  equals  unity  for  the  current  tmie  imd 

The  Widrow  Hof!  tide.  ;ilso  ki.owu  as  the  v  >:lta  rule,  requifes  tluvt  Pj  be  a  Liftear 
f'liiution  of  *.y  aiu!  J(  &o  that  -  xj. 
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decreeises  exponentially  with  increasing  elapsed  time.  This  algorithm  is  referred  to 
as  TD(  A  )  and  (5.3)  defines  the  learning  process,  where  Pm+i  is  identically  z. 

Aia.  =  o(Pt+i  -  Pt)  E  A‘-* A^Pi  (5.3) 


0  <  A  <  1  (5.4) 

An  advantage  of  this  expone  tial  weighting  is  the  resulting  simplicity  of  determining 
future  values  of  the  summa'  ion  term  in  (5.3). 

+  1)  =  E  .-=  A,,Ph-i  +  E  A‘+'-*  A,„Pk  (5.5) 

/=!  » -.1 

=  +  AE  A‘-*A,„P*  =  A,,P4«  +  \S{t) 

fc=ii 

In  the  limiting  ca'  Wi  ere  A  =  0,  the  leaxni  ig  process  detennines  the  weight  in¬ 
crement  entirely  W  '  tb  resulting  effect  on  the  most  recent  prediction.  "  'his  TD(0) 
algorithm  (5.6)  re?.erables  (5.1)  if  the  final  outcome  z  is  replaced  by  the  subsequent 
prediction. 

Atn,  =  q(P,h  -  P,)A„.P,  (5.6) 


5.2  An  Exj.eiisi.on  of  TD(A) 


'!  he  I  D  family  oi  k-Mmiig  procedures  iirectly  generalize',  to  iu'complisli  the 
'r':‘ii5rl:ioii  of  a  di.srou:at<,'d,  cuiaulativ','  re.sult ,  such  n-s  tli  expected  <iis<  omited  future 


■m 


ATTACHMENT  3 


5.3  A  Comparison  of  Reinforcement  l>earning  Algorithms 

cost  ajisociated  with  an  infinite  horizon  optimal  control  problem.  In  conformance 
with  Sutton’s  notation,  Ct  denotes  the  external  evaluation  of  the  cost  incurred 
during  the  time  interval  from  t  —  1  to  i.  The  prediction  Pj,  which  is  the  output  of 
the  TD  learning  system,  estimates  the  expected  discounted  future  cost. 

(X> 

(5.7) 

n=0 

The  discount  parameter  7  specifies  the  time  horizon  with  which  the  prediction 
is  concerned.  The  recursive  nature  of  the  expression  for  an  accurate  prediction 
becomes  apparent  by  rewriting  (5.7). 

00  00 

P,_1  =  Ct  +  53  7”ct+n  =  Ct  +  7  53  7"Cf+„+i  =  Ct  +  7-ft  (5.8) 

IV"  1  n-=0 

Therefore,  the  error  in  a  prediction,  (ct  4  7P{)  -  Pt-i,  serves  as  the  impetus  for 
learning  in  (5.9). 

t 

Au^t  -  a(ct  +  yP  --  Pt-i )  53  A'-*  A,,P^  (5.9) 

5.3  A  Comparison  of  Reinforcement  Learning  Algorithms 

liie  modified  TD(  A  )  rule  (5.9)  is  referred  to  as  the  Atlaptivtr  Hemistic  Critic 
(  and  lemii;  to  rirf.lirt  the  suauuHtiou  of  tlie  disroimtiui  future  VJdv!<‘^i  <'if  the 

N\rua.l  r,.  Vv'itli  sliiduly  .d.it7ereui  learnitig  e<juHti.i>u.s,  both  Q  leainiug  a.ud  Drive- 
lo'iiitoi  (  DH)  le'trumi.;  <.i('i'<.)m  phsli  a  siindiu'  objective.  This  sect  ion  t'oui]  i;u<'s 

die  u'lT  in  f'.nictu*!!  of  tlu'se  three  diyect  leiiriuiig  algen  it  funs:. 
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The  single  step  Q  learning  algorithm  (5.10)  defines  a  change  in  a  Q  \aitie, 
which  represents  a  pretliction  of  expected  discounted  future  cost,  directly,  rather 
than  throu)  1  adjustments  to  a  set  of  weights  that  define  the  Q  value. 


Qtj.i(xi,at)  =  (1  -  a)Qtixt,  at)  +  Q{ct  4-  ))  (5.10) 

Although  the  form  of  the  learning  equation  appears  different  than  that  of  the  AHC 
or  DR  learning,  the  fimctionality  is  similar.  The  improved  Q  value  Qt+i(xt,at) 
equals  a  linear  combination  of  the  initial  Q  value  Qt(xt,at)  and  the  sum  of  the 
cost  for  the  canrent  stage  c-t  and  the  discounted  prediction  of  the  subsequent  dis¬ 
counted  future  cost  A  similar  linear  combination  to  perform  incremental 

improvements  is  achieved  In  both  the  AHC  and  Drive-Reinforcement  learning  by 
calculating  weight  changes  with  respect  to  the  current  weights. 

Both  the  Drive- Reinforcement  (DR)  and  the  Adaptive  Heuristic  Critic  learn¬ 
ing  mechanisms  calculate  weight  changes  that  are  proportional  to  the  prediction 
error  APj. 

/aPi  ~  Ci  -f  jPt  -  P,_i  (5.11) 

Tlie  DR  learning  rule  is  rewritten  in  (5.12)  to  conform  to  the  c’’rrent  notation. 

Awt  -  AP,  /.  ( (5.12) 

in  tiie  DR  and  AHC  algorithiiis,  a  non  I'erc;  prediction  error  causes  the  weights  to 
lie  uiijustedi  s(.i  that  P(.-i  would  have  betvi  clo.-«r  to  c,  f  7P,.  The  coustfuit  of 
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proportionality  between  the  weight  change  and  the  prediction  error  differs  between 
DR  learning  and  the  AHC. 

The  Drive- Reinforcement  weight  changes  ate  defined  by  (5.12).  The  limits 
on  the  summation  over  previous  stages  in  time  and  the  binary  facilitation  function 
/,  prescribe  modifications  to  &  finite  number  of  previous  predictions.  An  array 
of  constants,  c*,  encode  a  discount  that  determines  the  contribution  of  previous 
actions  to  the  current  prediction  error.  In  contrast,  the  summation  term  in  the  AHC 
learning  equation  (5.9)  allows  all  previous  predictions  to  be  adjusted  in  response 
to  a  current  prediction  error.  The  extent  to  which  an  old  prediction  is  modified 
decreases  exponentially  with  the  elapsed  time  since  that  prediction.  In  the  AHC 
algorithm,  the  sensitivity  of  the  prior  prsdiction  to  changes  in  the  weights,  Au,Pi, 
scdes  the  weight  adjustment. 

Similar  to  the  AHC  and  DR  learning,  an  incremental  change  in  a  Q  value  is 
pioportionaJ  to  the  prediction  error. 


AQt(xt,a<)  =  (?t+,(x<,a<)  -  Qt(xt,at)  =  a(c<  -  Qt{Xi,at)  +  (5.13) 

The  expression  for  the  prediction  error  in  (5.13)  appears  different  from  ^5.11)  and 
warraats  some  explanation.  V'((x,+i),  which  represents  maxa  [Qt(i'i+i)  »i<+i)J)  de¬ 
notes  the  optimal  jirediction  ^^f  discounted  futme  cost  and,  therefore,  is  functionally 
equivalent  to  Pt  'n  (5.11).  Moreover,  the  entire  time  basis  for  Q  learning  is  shifted 
forward  one  stage  v.ath  respect  to  the  .\HC  or  DR  leruning  rules.  As  a  result,  Qt 
c{)erates  similar  to  in  (6.11)  and  the  .symbol  c,  performs  the  same  function  in 
(5. 11 )  <ujd  (5.13),  although  the  cost  is  measured  over  a  different  period  of  time  in  the 
Q  learning  rule  thmi  in  the  AHC  or  DR  iea'^ning  rules  (5.9)  mid  (5.12),  respectively. 
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To  siimraorize  the  compexison  of  direct  learning  aigoritbms,  each  of  the  three 
temporal  difference  techniques  will  learn  a  ’/alue  function  for  the  expected  dis¬ 
counted  future  cost.  More  generally,  any  direct  learning  algorithm  will  maintain 
and  incrementally  improve  both  policy  and  value  function  information.  Further¬ 
more,  although  the  forms  of  the  learning  equations  tiiffer  slightly,  each  method 
attempts  to  reduce  the  prediction  error  A/j. 

Although  the  functionality  cf  direct  learning  algorithms  may  be  similsu',  the 
structure  will  vary.  Por  example,  Q  learning  disiiuguishes  the  optimal  action  by 
mxdmizing  ever  tlie  set  of  Q  values.  The  Associative  Control  Process  determines 
the  optimal  action  through  the  biologically  motivated  reciprocal  inhibition  proce¬ 
dure.  Fuiihermore,  whereas  Q  values  may  equal  any  real  number,  the  outputs  of 
ACP  learning  centers  musx  be  non-negative,  ftcknowledging  the  inability  of  neurons 
to  realize  negative  frequencies  of  firing. 
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EaA,-ii  control  law  derived  in  this  chapter  attempts  to  optimally  track  a  refer- 
ence  trajectory  that  is  generi,ted  by  a  linear,  time-invariaat  reference  model  (i'.l); 
optimization  is  performed  with  respect  to  quadratic,  cost  functionals  over  finite  tinae 
horizons.  The  notation  in  this  chapter  uses  subscvipt.i  to  indicate  the  stage  in  tijs- 
Crete  time  and  superscriptr  to  distinguish  the  plant  rmci  reference  model 


=■  -f  r’'rj(  (^Aa) 

Vk  —  C'xj  (6.16) 


yl,,  -  cr^^  xi  A  (7rn 


(6.ic) 


(fl.Id) 


Altb\;K.gh  a  few  subsequent  /alr'es  of  the  command  input  afta’  r*  invy  be  charac¬ 
terized  at  t'lre  k,  the  fature  input  sequence  will  be  largely  uoknowTi.  To  apply 
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infinite  horizvn,  iinear  quadratic  (LQ)  control  techniques  to  the  tracking  problem 
reqtiires  a  description^,  of  the  future  command  i.nptifcs.  FXirtfcer'more,  in  a  multi- 
objeciin-e  mission,  such  as  aircraft  flight  involving  several  different  dight  conditions, 
the  existence  of  future  maneuvers  should  negligibly  influence  the  optimization  of 
perfonnanoe  duiiing  the  current  operation.  Finally,  optimizations  over  unnecessar¬ 
ily  long  time  frames  may  roquire  prohibitir^ely  long  computations.  Therefore,  finite 
horizon  LQ  control  direcxly  addresses  relevant  control  problems. 

6vl  Single-Stage  Quadratic  Cp^'rnization 


The  control  objective  is  to  minimize  the  quadratic  cost  functional  Jk  which 
penalizes  the  crurrent  control  expenditure  and  the  output  error  c^+i,  given  by  the 
difference  between  the  reference  output  and  the  system  output  at  time  fc-f  1.  The 
weighting  matrices  R  and  Q  aie  symmetric  and  positive  definite. 

’4  -  “  [«■  V, +  w J  (6.2) 

~  Vk-^x (6.3) 

A  single  first-oider  i5e€essa»-y  couditicn  defines  the  condition  for  a  control  u*.  to 
lainimize  the  cvxst  fimctional  [22,23]. 

dJk 

" —  — •  0  (6.4a) 

duk 
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(^)  <?'»«  + =  » 


(6.46) 


6.1.1  Linear  Compensation 

Assuming  a  minimum-phase  plant,  the  linear  compensator  is  the  solution  to 
the  problem  of  optimal  tracking,  with  respect  to  the  cost  functional  (6.2),  of  the 
reference  system  (6.1)  with  a  linear,  time-invai-iant  system  (6.5).  Applied  to  a 
nonlinear  system,  this  control  law  serves  as  a  baseline  'nith  which  to  compare  a 
single-stage,  indirect  learning  control  law.  The  fact  that  indirect  learning  control 
is  a  model  based  technique  distinjpiishes  the  approach  from  direct  learning  control 
algorithms. 


Xk+i  =  ^ik  +  Ttifc  (6.5a) 

yik  =  Cxk  (6.56) 

yk+i  =  C^Xk  CFu*  (6.5c) 


That  the  partial  derivative  of  Ck^i  with  .espect  to  u*  is  independent  of  u*  implies 
that  (6.4b)  is  linear  in  the  optimal  control.  Therefore,  (6.4b)  may  be  written  as  an 
exact  analytic  expression  for  the  optimed  control  [24]. 


dfk+i 

(hik 


CT 


(6.6) 


{CrfQCT  f  R\  {CrfQlC^Vxl  +  CTVfc  - 


(6.7) 
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The  sufficient  condition  for  this  control  to  be  a  minimizing  solution,  that 
is  non-negative,  is  satisfied  for  physical  systems, 

-  ~(crf  +Ru,  =  0  (6.8) 

^  =  (CTfQiCT)  +  R>0  (6.9) 


6.1.2  Learning  Control 

In  contrast  to  the  single-stage  linear  compensator,  the  single-stage,  indirect 
learning  controller  is  the  solution  to  the  problem  of  optimal  tracking  of  the  refer¬ 
ence  system  (6,1)  by  a  nonline,ar,  time- invariant  system  (6.10),  with  respect  to  the 
cost  functional  (6.2).  Again,  the  zero  dynamics  of  the  plant  must  be  stable.  The 
expression  for  the  discrete  time  state  propagation  (6.10a)  includes  the  a  priori  linear 
terms  from  (6.5)  as  well  as  two  nonlinear  terms:  «*)  represents  the  initially 

immodeled  dynamics  that  have  been  learned  by  the  system,  and  'i'*(xii()  represents 
any  state  dependent  dynamics  not  captured  by  either  the  a  priori  description  or  the 
learning  augmentation.  The  assumption  of  an  absence  of  time  vaiying  disturbances 
and  noise  from  the  real  system  implies  that  ail  dynamics  are  spatially  depeudt  nt 
and  will  be  represented  in  the  model.  The  system  outputs  are  a  known  lineaa  r<  tn- 
bination  of  the  states.  The  notation  explicitly  shows  the  time  dependence  of  /* 
and  which  change  as  learning  progresses;  /*  will  acquire  more  o  the  subtlety's 
in  the  dynamics  and,  tons<'quent!y,  ‘k*  will  approach  zero. 

Xi+i  ^  f  lu*  +  fk{j  +  ''iki^k)  (6.10a) 
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Vk  =  Cxk  (S.lOfc) 

~  C$2i  +  CFufc  +  Cfk{xk,  uk)  +  C^u{xk)  (6.10c) 


Id  this  formulation,  the  first-order  necessary  condition  (G.4)  for  a  control  tij,. 
to  be  optimal  with  respect  to  (6.2)  cannot  be  directly  solved  for  ujt  because  of 
the  presence  of  the  term  /fc(xfc,«t)  which  may  be  nonlinear  in  u*.  The  Netrion- 
Raphson  iterative  technique  [25J  addresses  this  norJinear  programming  problem  by 
linearizing  /*(2t,tit)  with  respect  to  ti  at  txfc_i.  is  the  Jacobian  matri*>:  of  /t 
with  respect  to  u,  evaluated  at  Using  this  approximation  for  /*(iit,«*), 

yk+\  assumes  a  form  linear  in  u*  and  (6.4)  may  be  v/ritten  as  an  analytic  expref  don 
for  ujt  in  terms  of  known  quantities. 


fk{Xk,nk)^fk{Xk,Uk-l)->t 


du 


{Uk-Uk-l) 


(6.11) 


I/jfc+i  sa  C^Xk  t-  CTtik  -f  Cfk(xk,  ui-i)  4-  C 


dti 


(Uk-Uk-O  +  C^kixk)  (6.12) 


-1 


(6.13) 

The  solution  (6.14)  is  the  first  Newtori-Raphson  estimate  for  the  optimal  con¬ 
trols;  a  pseudo- inverse  may  be  used  in  (6.14)  if  the  full  matrix  inversica  does  not 
exist.  Subsequent  estimates  u\  may  be  derived  by  linearizing  (6.10)  about  u^'^. 
However,  the  estimate  e>btained  in  the  first  iteration  is  often  sutiicie?  itty  accurate 
because  the  initial  liiiearivation  is?  about  ujt„i  and  the  adnn.'^sibie  change  in  control 
Au*:  =-  u*  -  Uh-\  will  be  Ii.niited  by  actuator  rat"  limits  and  a  sufticicTuly  small 


_  _QY  —  C 
duk  "  du 


discrete  time  inter’-Til  [25,26]. 
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The  form  of  the  learning  atigmented  control  law  closely  resembles  the  linear 
control  law  (6.7).  In  (6.14)  CF  is  modified  by  which  describes  the  variation  in 
control  effectiveness  that  was  nnmodeled  in  the  linear-  system.  The  final  three  terms 
of  (6.i  4)  are  not  eaent  in  (6.7)  and  enter  from  the  nonlinear  teims  in  (6. JO). 


ip,  O') 

r  / 

^1) 

-  -I 

+  ^ 

^6T+ 

T  C"'l 

.  -  pi 

Uj 

(6.i4) 

A  simple  '  ’a-ptiveestira’vf  ^  ^ht  mmot  -fied  Jy.  time /b  is  gener»!.ted 

y  solvivig  (6,10)  at  the  prvrious  u.  enAO-  <■  4  .  ^  d  assuming  '^kixk)  ^ 

This  )w1apv.i‘.e  r„  >iiq  usceptible  to  u  <iv  jn  hsturbances  present 
■r  chf'  !-ea1  system  ^27] 

't^i(r,t  -  k  <1  ,1  -  /,  .■._i,u*.,)  (6.15) 

■').  "  'or,  -li  t  ■  tt 

p:  ra  lei  nents  Uiay  U:,-  ..a  de  vc  a  control  law  that  is  opti- 

a,-.:  n  U  1  -ipect  to  a  cost  functional  tuai  ...o  penalizes  c}iange;>  in  control.  The 
^  -x  -  'j.  len  in  16)  si  ,\i',iges  large  control  rates  that  mnv  not  be 

a  :i  va  h',  Hi.  a  rc,:uli  T  lysic;  iuuitatioas  of  the  actuators.  The  control  law 
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(6.18)  resembles  (6.14)  vnth  the  addition  of  two  tjrms  invoiv'ing  J,  which  is  sym¬ 
metric  and  positive  definite. 


A 


Jk  — 


ulRttk  +  Auf5A«ij 


(6.16) 


A«jt  =  uk- 


(6.17) 


(cr+cg"o(cr+cg+.-j+s] 

(CT  +  C^^^  a(6"rx;-bCT  rt-C^xi 
-  Cfk(xk,ttk~x)  +  C™-Ofc_i  -^Suk 


(6.18) 


6.2  Two  -Step  Quadrcitic  Optimization 


This  section  parallels  the  distnission  of  §6.1  to  derive  the  control  laws  that  are 
optimal  with  respect  to  a  two  time  step,  quadratic  cost  functional  (6.19);  a  few  new 
issues  arise.  The  expression  for  t.he  reference  output  two  tirne  steps  intc-  the  future 
(O  ld)  involves  a  future  value  of  the  ccmniaird  i.np  t  rj.4,1.  Tnis  derivation  assumes 
that  Tjti,  Tk. 


2  t 


f..'i 


(6,19) 


4T3 
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Tw«  necesmvy  conditions  are  reqtiired  to  deinc  a  control  which  is  optimal  with 
respect  to  (€.19  ).  Each  of  the  weighting  matrices  in  (6.19)  is  symmetric  and  positive 
definite. 


Q^^k+l  f 


o 

!l 

(6.20a) 

7 

f  ^  ^  ^  -f-  —  0 

(6.206) 

d«fc+i 

(6.21a) 

■  (at*.*:)  ’“+■  -  “ 

(6.216) 

6.2.1  Linear  Conepensation 

The  output  of  the  linear  system  (6.5)  two  time  steps  into  the  future  is  easily 
determined  becausr;  che  •dynmuics  are  ass’un  d  to  be  entirely  known. 

-  C^^Xk  +  C4-ru*  +  CFu*.,!  (6.22) 


CMi  ■- yi+2  ytt  :<  (6. ‘23) 

L'he  simiiUr.a<x>us  solution  of  (6. 20b)  and  (G.2ib)  yields  d  solution  for  ti,^;  fui  toe 
iucssiou  for  Uk-n  .  calrulated  at  time  k,  i.s  also  dvaib:.blt'.  However,  to  paraJlel  the 
.'.'BT'T'irg  coa'.rol  pro<  ess,  this  control  will  be  irectdculaied  at  the  next  time  step.  To 

^1.54 
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illustrate  the  similarity  of  the  linear  said  ieai'cing  control  laws,  as  well  to  express 
the  laws  in  a  compact  form,  several  substitutions  are  defined. 


A  =  Cr  (6.24a) 

B  =  C'‘^^  (6.24&) 

A--=A^Q‘A^B^  (6-25a) 

T  =  B'^Q^A  (6.256) 

e  =  A'^'Q^  (6.25c) 

E  -=  (B^  ~  rA-^A^)Q^  (6  25d) 


u,  =  A  Iff'  ~  TA-*T^j"‘ 

4-  (0  CA  '  A  H  (C'4»T^  +  cn"’ ))  rt 

-(0C<I>4-HC'4*$).r4]  (6.26) 

6.2.2  Learning  Control 

For  the  nonlinear  systein,  the  output  j/trj  (Q.27m)  mu.st  he  .•  pproxiniated  by 
known  quantitiei;  that  fixe  linefir  in  xik  and  tn^i.  F'irsv,  the  uonlineai  tcrm.s  in 
{6.27a)  are  ewduated  at  ti  e  rurrent  time  k  iiither  than  at  A'  f  l  and  an  af>p*ro>;i- 


ination  r*.  ^  ns  derived  for  the  next  state.  .,4dditionfJly,  the  lefU’ned  dyiruinjo.s  ai'e 
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estimated  by  iineaming  about  the  point 


Vw  =  C^XkJi-i  +  CTuk-i-i  +  Cfk+i(xi;+i,Uk+i)  +  (7'l'ir+i(zHj)  (6.27o) 

ss  d-Crut+i  +  C fk{xk+i,Uk+-^  +  C^ki^k+i)  (6.276) 


xk+i  -  $ifc  +  Tuk  +  fk(xk,Uk-i)  f 


du 


{uk-nk-i)  +  '‘ifkixk) 

**.“*-1 


(6.28) 


fkiXk^l,Uk+-i)  »  fk{Xk,Uk-l)-\- 


(Ju 


(uk+i-Uk-i)  h 


dx 


(Xk+I-Xk)  (6  29) 


**.“*-1 


yjt+2  C^Xk^i  4-  CTut+i  -f  Cfk{xk,Uk-i)  +  C 


du 


(Uk+I  -Uk-i) 


**.»*-> 


i  (ik^i~Xk)  +  C^kih+i) 


(6.30) 


Using  this  approximation  for  yk+i,  the  simultaneous  soh  tion  of  (6.20b)  and 
(G.21b)  jields  an  expression  for  m*  m  terms  of  (6.25)  and  (6.31).  The  variables  A 
and  B  include  both  the  linear  compouents  of  the  .a  priori  mcnlel  as  v/ell  as  learned 
state  dependent  corrections.  is  a  correction  to  the  constant  P  matrix  and 
is  a  correction  to  the  constant  matrix. 


cr  f  C 


K 

dti 


B 


C‘i>r  I  vcij-  T  ■> 

(Ou  ax  on  ax 


(6.31  a; 


(6,315) 


A!l.l',oi)gh  the  siiuvi!tan.;X'u::;  .--'..liutioij  of  the  first- order  luxosssary  coiiditu-ir-  also  yields 
an  exjne.'sion  for  ^.v+i  at  A:  nr  is  calculattxl  on  every  time  step  'I’his  control 
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law  resembles  the  form  of  (6.26)  amd  introduces  several  terms  associated  with  the 
nonlinear  dynamics. 


Uk  =  [a^QKA  +  B^Q^B  +  -  TA-^T^J " 

[(0  + 

+  (0  CT’'  H  4-  CT"))  rt 

-  (0  +  E  a-j 


at 


^^.d}  /  df  Of 

-r  1  0  C— - h  c-  I  C9—  f  C-^-  -f-  I  I  tit_i 

\  OH  \  ou  on  ax  au ) ) 

-  (eC4  E  f(7$  +  C4-C-^^-))/i(tsut..i) 

\  \  'J^)} 

MW.-,. 


e  c  4  H  c  c 


dx 


C'l'Hi-.,,)! 


(6.32) 


6.2.3  Multi-stage  Quadratic  Optimization 

The  axguiueuts  oK'sentee  ;ii  §G.l  imd  thus  f;ir  in  5C.2  iDay  he  generiUi/ed  to 
derive  a  control  law  wlndi  is  optiirud  with  respect  to  a  cost  fuuctionid  (6.32)  that 
looks  7,  time  si<'ps  into  the  lutur".  The  solution  of  this  problem,  however,  will 
rts.piire  ;ussuiiiptions  about  tlu'  piopagatKiii  of  the  coiun.aiid  irijait  r  for  u  future 
tune  steps,  /tdditioiudly,  the  ;dg<  lira  r»s,iuired  t-  -  wntc'  7ui  t'xplicit  expression  for  xi), 
l.iecoines  iii\'(,)lved  ami  t  lie  necessary  luitso.  ii-'ntious  ticcouie  h'ss  a.’curatte 

I  "  I  .  I 

'h  t  u;,,  /rti,.,,.  j  i  ij  i 

‘  !  1 


'h  I  i 


6.3;d 
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Chapter  6  -  Indirect  Learning  Optimal  Control 


6.3  Implementation  and  Results 


6.3.1  Reference  Modei 

The  reference  model,  which  generates  trajectories  that  the  plant  otates  at¬ 
tempt  to  follow,  exhibits  a  substanti .T  in€uence  on  the  closed-loop  system  perfor¬ 
mance.  While  a  reference  model  that  does  not  satis! pe  formance  specifications  will 
y  eld  an  unacceptable  closed-loop  system,  a  reference  model  that  demands  unreal¬ 
istic  (i.e.  unachievable)  state  trajectories  may  introduce  kLstability  through  control 
saturation.  Therefore,  the  reference  model  must  be  selected  to  jdeld  satisfactory 
performance  given  the  limitations  of  the  dynamics  [28]. 

the  reference  m  del  was  selected  to  be  the  cl' sed-!<x>p  system  that  resulted 
from  applying  the  optimal  control  from  a  Huear  quadratic  design,  to  the  aeroelastic 
oscillator  dynamics  Uncarized  at  the  origin  [29].  l‘he  disci  ete  time  representation  of 
the  1  ’ference  model  as  well  as  the  AEO  hnear  dynamics  are  presented  for  Ai  =  0.1. 


^  lO 

0 

1. 

(6.34a) 

R=  1.0 

o 

1 

(6.346) 

::  =  C"  = 

[1  o.r 
[0  1  . 

(6.35) 

0.994798  0.1(X>0'’0] 

-0.106070  ].!  2208,1  J 

1 

(6  36a) 

0.005202' 

0.106070, 

(6.366) 

45P. 
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■  0.9C3124  0.000286  ] 

-0.9a5?24  -0.000286] 


r"  = 


0.0948761 

0.905124] 


(6.37a) 

(6.37&) 


Design  of  an  optimal  control  law  might  be  accomplished  with  a  learning  system 
that  incrementally  increases  closed-loop  perfonnance  requirements,  by  adjusting  the 
reference  trajectoiy  in  regions  of  the  state  space  where  the  current  control  law  can 
achieve  near  perfect  tracJkiag.  This  is  a  topi?  for  future  reseaich. 


6.3..2  Function  App?.'oximation 


The  discussion  of  direct  learning  optimal  control  (§3  -  §5)  focused  on  learn¬ 
ing  system  architectures  and  algorithms  which,  themselves,  operate  as  c ontrollers. 
'^he  discussion  of  indirect  learning  optimal  control  is  primarily  concerned  with  the 
manner  in  which  e.xp?  v  icntial  irJormation  about  unroodeled  dynamics  may  be  in¬ 
corporated  into  optimal  control  laws.  The  method  by  which  a  supervised  leauming 
system  approximates  the  initially  uamodeled  dynamics  is  a  separate  issue  which 
has  received  much  investigation  [21,30,31,32]. 

After  a  brief  summary,  this  thesis  abstracts  the  technique  for  realizing  the 
nonlinear  mapping  /(i,  u)  into  a  black  box  which  provides  the  desired  informat-on: 

find  /i(x*_i,Ufc_i). 

A  spatially  loc  alized  connectionist  network  is  used  to  represent  the  mapping 
from  the  space  oi  and  control  to  the  initially  i-iuiiodeled  dynamics.  The  linea..- 

Gaussifin  netwiui'  .achieves  .spatial  locrdity  by  coupling  a  locnl  basis  fimction  with 
an  .'uflueu  'e  funct  i  :i  [4,28].  The  influence  function,  wliirJi  dc  veraiiners  the  region  in 
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the  input  space  of  applicability  of  the  associated  basis  function,  is  a  hyper -Gaussian; 
the  basis  function  is  a  hyperplane. 

The  contribution  of  a  basis  function  to  the  netwo  output  equals  the  product 
of  the  basis  function  and  the  influence  function,  evaluated  at  the  cuitTent  input, 
where  the  influence  function  is  normalized  so  that  the  sum  of  all  the  influence 
functions  at  the  cxnrent  input  is  unity  [28].  The  control  law  provides  to  the  network 
an  estimate  of  the  model  errors,  xk  —  —  Ttn-i.  The  supervised  learning 

procedure  follows  an  incremental  gradient  descent  agorithm  in  the  of  the 

network  errors  by  adjusting  «he  parameters  that  describe  the  slopes  and  offsets  of 
the  basis  functions. 

In  terms  of  equations  and  for  arbitrary  input  and  output  dimensions,  Y(x) 
is  the  network  output  evaluated  at  the  current  input  x.  The  network  consists  of  n 
nodes  (i.e.  basis  function  amd  influence  fiinction  pairs). 


J'(i)  =  j;L.(j)r.(^)  (6  38a) 

i-l 

L,{x)  is  the  e-v  luation  of  the  basis  function  at  the  current  input.  IT,  is  a 
weight  matrix  that  defines  the  slopes  of  the  byperplane  and  b,  is  a  bia^  vector,  xq 
defines  the  center  of  the  influence  function. 

L,(x)  =  IV,(i  ~  xo)  f  6,  (6.386) 


i',(  r)  is  ;.he  nonnalized  influence  function  and  is  not  to  the  discrete  time 

B  matrix.  G\(x)  is  the  t‘^‘  inti uerice  function  evuloate'd  ai  x,  where  the  diagonal 


ATTACHI/i&'NT  3 


6.3  UmplemeniHticn  aud  E«i>uJts 

matrix  D,-  represents  the  fspatial  decays  of  tLe  Gaussians. 

(6.38c) 

G,(s) exp'  ^1(3^  -  ^o)j  (6.38d) 

The  learTiing  network  had  three  inputs  (position,  velocity,  and  control)  and 
two  outputs  (uiimodeled  position  and  velocity  dynamics^,  m  addition,  the  partial 
derivatives  of  the  system  outputs  with  respect  to  the  inputs  were  available. 


Velocity 

Figure  G.l*  The  initially  uiimodeled  velocity  dyuaoiics  jj(:ri)  as  » 
fiuictiou  of  velocity  rj. 


r,(f) ... 


461, 
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The  AEO  dynamics  are  repeated  in  (6.39).  The  learning  system  must  syn¬ 
thesize  on-line  the  initial  model  vmcertainty,  which  consists  cf  the  nonlinear  term 
<7(12)  m  the  velocity  dynamics. 


Xl 

X2 


0 


-1  nAiU-  2/1 


[xi]  ro' 

LX  3J  [1. 


FA 


0 


(6.39a) 


<7(3^2)  = 


iOOU 


—(lOOOxa)^  -f  ^(1000x2)®  -  ^(1.000x2)^1  (6.396) 

U  I 


The  memritr  in  which  the  position  and  control  enter  the  dynamics  is  lineai-  and 
perfectly  modeled.  Therefore,  the  hmction  /  wdll  be  independent  of  the  position 
and  control  (Figure  6.1).  Additional  model  errors  may  be  introduced  to  the  a  priori 
model  by  altering  the  coefficients  that  describe  bow  the  state  an  :!  control  enter  the 
linear  dyaamics.  The  learning  system  will  approximate  all  model  uncertainty. 

Although  the  magnitude  of  the  control  had  been  limited  in  the  direct  learning 
control  results,  where  the  reinforcement  signal  was  only  a  function  of  the  state 
error,  limitation  of  the  control  magnitude  was  not  necessary  for  indirect  learning 
controllers  because  control  directly  entered  the  cost  functional. 


6.3.3  Single-Stage  Quadratic  Optimizatiou  Results 


For  the  minimization  of  the  weighted  sum  of  the  squares  of  the  current  contrc^ 
and  the  succeeding  output  error,  the  performance  of  the  learning  enhanced  control 
law  (G  14)  was  compai-fsl  to  the  associat-ed  line<u  control  law  (6.7),  in  the  context 
of  the  iurroelarhie  oscillator  .  Resuit.s  ajipear  in  Figures  6.2  through  6.9  The  c<  atrol 
iuid  veference  model  were  i.’pdatf'ii  at  10  Hz;  the  AEO  siMnilatioa  was  iti'  ^-.rated 
using  n  fourth-oider  Runge  Kutta  algorithm  with  a  st^-p  siz<‘  of  0.005. 

4ti2 
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Figure  6.2,  Position  and  velocity  time  histories  for  the  reference 
model  as  well  as  the  AEO  controlled  by  the  hnear  and 
learning  control  laws,  for  the  command  r  =  1  and  the 
initial  condition  ^  =  {0,0}. 


Figure  6.?,  illustrates  the  reference  model  position  and  velocity  time  histori«is 
as  well  as  two  sets  of  state  time  histories  for  the  AEO  controlled  by  the  linear  and 
learning  control  laws.  The  presence  of  unmodeled  nonlinear  djnamics  prevented 
the  linear  control  law  from  closely  tracking  the  reference  position.  In  contrast, 
the  learning  system  closely  followed  tlie  reference,  after  sulBcie.nt  training.  Both 
control  laws  irifuntained  the  velocity  near  the  reference.  Although  the  full  lea.  mug 
roRtrol  law  (6.14)  was  implemented  to  produce  these  resi’it.s,  knowltxlge  of  the  foiij) 
(  f  the  ,AE()  nonlinearities  could  have  been  us«l  to  eliiiiiii.ite  the  terms  ''oinauiing 
^  .  Figiire  6.3  represents  the  errors  between  he  AEO  states  and  the  reference 
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Figure  6.3.  1'he  stjvte  errors  for  the  AEO  controiitwi  l)„v  th’S  Hnea?. 
and  learairig  wai-tol  laws. 


i  une 


'J'iie  network  outputs  iliac  were  used  lo  compntc  n.j 
fur  the  learning  eontr  1  law. 


m 
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model,  for  both  control  law  designs.  In  a  separate  experiment  that  introdii'  ed 
model  uncertainty  in  the  linear  dynsniics,  the  linear  control  law  (6.7)  failed  to 
track  a  command  step  input  with  zero  steadj'-state  error.  The  learning  control  law 
resnlts  looked  similar  to  Figure  6.2. 

The  specifics  of  the  incremental  function  appioximation  were  not  a  focus  of 
this  indirect  leaiming  control  research.  The  learning  process  involved  numerous 
trials  (more  than  .1000)  from  random  initial  states  within  the  range  {—1.0, 1.0}; 
the  commarided  p{>sitiou  was  iilso  selected  randomly  between  the  same  limits.  The 
ctJlocetion  of  network  resources  (i.e,.  adjustable  parauneters)  and  the  selection  of 
Icar-ning  rates  invol^^ed  heuristi.:.s.  Moreo%'er,  the  learning  perfom;anoe  depeitded 
strongly  on  these  decibio.ni;.  Automation  ol  the  network  design  pr<'ce8s  woult  have 
greatly  facilitated  this  reseai'ch. 


T.he  ieaiaing  control  law  recuiii'es  the  trsJuts  of  the  nch'  ork  outputs  at  the 
ciment  Htate  and  the  previous  control,  as  well  t.be  partial  deri  .rtive  of  the  network 
oiiiptits  wit.h  sespfxit  to  i.he  contro.l,  at  the  current  stat-"  uk.'  the  previous  contj  ol. 
A(id'ticiialiy,  the  adaptive  terra  tl'.r(xt)  requires  the  vnlnc.'^  >,  he  network  output  s 
at  the  previous  s'-ute  mul  the  previous  oonrroh  ih--  .netv.  rk  ouipuis,  v/hicii  rppeai 
in  Figure  6.4,  change  must  rapidly  vi’heri  the  veloc  ty  is  not  neru'  zero,  i.e.  at  the 
begimi.jr.g  of  a  trial.  Some  rapid  diaiiges  in  the  uetw'xrk  outputs  resisted  fro,m 


learumg  eni.irs  uhere 


i'bi  tiu'  Icanung 
foinp.rih  ‘  tht;  Ci’iitro!  i 


/  did  not  sK'curately  a.ppro..\,in»i7.te  the  iK.>nhr,5e-.'.r  dyiiamia). 
t\,):  It  rill  law.,  the  coalrol  as  well  as  the  tejins  of  (6.14)  that 
u:  I'ig.ure  6.5.  coatioi  fui  ttie  lurear  control  law 


(Uid  the  riirlivrsual  teems  oi  (6  V)  Hppeai'  in  Figure  t’.6. 


After  snbsijuitial  trainnig,  some  errors  remaiu*  *  in  the  network’s  appro-duia- 
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«■  t.  V  The  control  u*  and  the  coQstiluent  teruns  of  the  leain- 
iug  control  law  (6.14). 


ign  •(,  ?j  I-.  '1  '  ontrol  «:  H  i  i  c  .f: ■  terras  i.if  thn  Hiif  iir 

.!  t>l  law  (i. 
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Time 

Figure  6.?.  The  estimated  errors  in  the  approximation  of  the  ini¬ 
tially  anmiodeled  dynamics 

tion  of  the  ncnlira?ar  dynamics.  These*  errors  are  most  notjceabl  •  in  the  estimation 
of  the  velocity  dynamics  at  velocities  not  aejir  zero.  Figure  6.8  illustrates  the  initial 
model  errov  s  not  rej>'  esented  by  the  function  /;  the  i  daptive  term  will  reduce  the 
effect  of  these  icrna  .g  model  errors.  Experiments  dt  nonstrated  that  t.he  system 
}>erformed  ntaily  a.s  w?  '1  wheI^  the  adaptive  contribution  was  removt'd  from  the  con¬ 
trol.  A  contioller  that  ugiuented  the  lineai“  law  with  only  the  adaptive  correction 
wa.s  not  evaluated. 

Figiac  6.S  shows  the  resiurs  of  control  laws  (6.14)  aiid  (6.7)  n^gniating  the  aEO 
froip  .?•(,  '  \  l.O,  0  5).  The  ctHitroi  magnitude  was  limited  at  0.5  so  th  it  the  results 
uiay  I.-e  euinjiase.::!  more  easil}'  t\>  the  benc.hmar.lv.s  and  the  direct  !"a  rung  control 
r?  suits,  rif.ne  i.s  not,  explicitly  shown  in  Figure  6.8;  the  staic  trajwTory  produced  by 
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-1.0  0.5  m  c.: 

Position 

Figure  6.8.  AEO  Regulation  from  jo  =  {—1.0, 0.5}  with  control 
saturation  at  ±0.5. 


0  2  4  6  8  10  12  14 

Time 

Figure  8.9.  Control  liistory  associated  with  Figure  6.8. 
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6.3  Implementation  and  Results 

the  learning  controller  approached  the  origin  much  more  quickly  than  the  trajectory 
produced  by  the  linear  controller.  The  control  objective  remains  to  track  a  reference 
trajectory  and,  therefore,  subtly  differs  from  the  goal  of  LQR  (Figme  2.5).  Recall 
that  this  reference  model  does  not  necessarily  maximize  system  performance.  Figure 
6.9  shows  the  force  histories  which  yielded  the  trajectories  in  Figure  6.8.  The  rapid 
switching  in  the  learning  control  force  results  from  leeiming  errors  and  the  sensitivity 
of  the  control  law  to  the  approximated  Jacobiein  of 

This  indirect  learning  control  technique  was  capable  of  learning,  and  therefore 
reducing  the  effect  of,  model  unceitdnty  (linear  and  nonlinear).  Therefore,  the 
indirect  learr).ing  controller  derived  from  r  linear  model  with  model  errors  performed 
similar  to  Figure  6.8  and  outperformed  the  LQR  solution  which  was  derived  from  an 
inaccurate  linear  model  (Figm-e  2.7).  The  indirect  learning  controller  with  limited 
control  authority  produced  state  trajectories  similar  to  the  resi.lts  of  the  direct 
learning  control  experiments. 
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Summary 


7'he  fieroelastic  oscillator  uemonstrated  interesting  nonlinear  dynamics  and 
served  as  an  acceptable  context  in  which  to  evaluate  the  capability  of  several  direct 
and  indirect  learning  controllers. 

The  ACP  network  was  introduced  to  illustrate  the  biological  origin  of  rein¬ 
forcement  learning  techniques  and  to  provide'  a  foundation  from  which  to  develop 
the  modified  two-layer  and  single  layer  ACP  architectures.  The  modified  two-layer 
ACP  introduced  refinements  that  iaci eased  the  architecture’s  applicability  to  the 
infinite  horizon  optimal  control  problem.  However,  resolts  demonstrated  that,  for 
tlie  dehned  plant  sr.d  environment,  this  algorithm  failed  to  synthesize  an  optimal 
control  policy.  Finally,  the  .single  layer  ACP,  winch  functionally  resembled  Q  learn 
ing,  successfully  ('onstructet.1  an  optimal  control  policy  that  regulated  the  aeroeiastic 
oscillator. 

i,l  learning  approach.'.^  th.  direct  te'iruing,  pa'a-tig'"  'tie  '.Miiflu  ii...vLiC.il 

theory  of  value  iteiati>  ii  rather  tluur  from  behavioral  science.  With  sulficient  train 
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ing,  the  Q  learnir^  algorithm  converged  to  a  set  of  Q  values  that  accurately  de¬ 
scribed  the  expected  disco’onted  future  return  for  each  state- action  yair.  The  opti¬ 
mal  policy  that  wa^  dehned  by  these  Q  values  successfully  regtilated  the  aeroelastic 
oscillator  plant.  The  results  of  the  direct  learning  eJgorithms  (e.g.  the  ACP  deriva¬ 
tives  and  Q  learning)  demonstrated  the  limitations  of  optimal  control  laws  that 
are  restricted  to  discrete  controls  and  a  quantized  input  spax^.  The  concept  of  ex¬ 
tending  Q  learning  to  accommodate  continuous  inputs  and  controls  was  considered. 
However,  the  necessary  maximization  at  each  time  step  of  a  (oiitinuous,  poten¬ 
tially  multi-modal  Q  function  may  tender  impractical  an  on-line  implementation  of 
a  continuous  Q  learning  algorithm. 

The  optimaa  control  laws  for  single-stage  and  two-step  tinite  time  horizon, 
quadratic  cost  functionals  were  derived  for  linear  and  nonhnear  system  models.  The 
results  of  ai.>plyi.ng  these  control  laws  to  cause  the  AEO  to  optimally  track  a  linear 
reference  model  demonstrated  that  indirect  learning  control  systems,  which  incor¬ 
porate  information  about  the  unraodeled  dyn.'miics  that  is  in(  cementally  lefuned, 
outperfonu  fixed  parmneler ,  linear  control  laws.  .Additionally,  operating  v/^ith  con¬ 
tinuous  inputs  and  outputs,  indirect  learning  control  methods  provide  better  jierfor 
mance  than  the  direct  learning  methods  previously  rnentione^i.  A  spatially  locali7.*.*d 
connect!  .)nist  network  was  einploy<-d  to  construct  the  a})i)roxunatiou  of  the  initially 
uiimodele.i  dyiuunic.s  tliat  is  required  for  indirect  learning  control. 

7.J  Coiu'hi.sitnis 


This  thesis  h;us  collected  .<«.‘vcial  direct  U'ai  uiug  optiinc!  coiitiol  <dg,oiit!iin.s  ;ujd 
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7.1  Conclusions 

has  also  introduced  a  class  of  indirect  learning  optimed  control  laws.  In  the  process 
of  investigating  direct  leaining  optimal  controllers,  the  commonadity  between  an  ^ 

algorithm  originating  in  behavioral  science  and  another  founded  in  mathematical 
optimization  help  unify  the  concept  of  direct  learning  optimal  control.  More  gen-  ’ 

eraJly,  this  thesis  has  “drawn  arrows”  to  illustrate  how  a  variety  of  learning  control 
concepts  are  related.  Several  learning  systems  were  applied  as  controllers  for  the 
aeroelastic  oscillator. 

7.1.1  Direct  /  Indirect  Framework 

As  a  means  of  classifying  approaches  to  learning  optimal  control  laws,  a  di~ 
reci/indirect  framework  was  introduced.  Both  direct  and  indirect  classes  of  learning 
controllers  were  shown  to  be  capable  of  synthesizing  optimal  control  laws,  within 
the  restrictions  of  the  particular  method  being  used.  Direct  learning  control  implies 
the  feedback  loop  that  motivates  the  learning  process  is  closed  around  system  per¬ 
formance,  This  approach  is  largely  limited  to  discrete  inputs  emd  outputs.  Indirect 
learning  control  denotes  a  class  of  incremental  control  law  synthesis  methods  for 
which  the  learning  loop  is  closed  around  the  sy.stem  model.  The  indirect  learning 
control  laws  derived  in  §6  are  not  capable  of  yielding  stable  closed-loop  systems  for 
non-minimum  phase  plants. 

As  a  consequence  of  closing  the  learning  loop  around  system  performance,  » 

direct  learning  control  procedures  acquire  information  about  control  saturation. 

Indirect  learning  control  methods  will  learn  the  miodeled  dynamics  as  a  hmction  r 

of  the  applied  control,  but  will  not  “see”  control  saturation  which  occurs  external 
to  the  control  system. 
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7.1.2  Comparison  of  Reinforcement  Learning  Algorithms 

The  learning  rtiles  for  the  Adaptive  Heviristic  Critic  (a  modified  TD(  A )  pro¬ 
cedure),  Q  learning,  and  Drive-Reinforcement  lez>ming  (the  procedure  used  in  the 
ACP  reinforcement  centers)  were  compared.  Each  learning  system  was  shown  to 
predict  an  expected  discoimted  future  reinforcement.  Moreover,  eeudi  learning  rule 
was  shown  to  adjust  the  previous  predictions  in  proportion  to  a  prediction  error  that 
was  the  difference  between  the  current  reinforcement  and  the  difference  between  the 
previous  expected  discounted  futxire  reinforcement  and  the  discA>imted  current  ex¬ 
pected  discounted  future  reinforcement.  The  constants  of  proportionality  describe 
the  reduced  importance  of  events  that  aie  separated  by  longer  time  intervals. 

7.1.3  Limitations  of  Two-layer  ACP  Architectures 

The  limitations  of  the  two-layer  ACP  architectmes  arise  primarily  from  the 
simultaneous  operation  of  two  opposing  reinforcement  centers.  The  distinct  posi¬ 
tive  and  negative  reinforcement  centers,  which  are  present  in  the  two-layer  AGP, 
incrementally  improve  estimates  of  the  expected  discounted  future  rewtird  and  cost, 
respectively.  The  optimal  policy  i."?  to  select,  for  each  state,  the  action  that  maxi¬ 
mizes  the  difference  between  the  expected  discounted  futiue  reward  and  cost.  How¬ 
ever,  the  two-layer  ACP  network  performs  reciproc2d  inhibition  between  the  two 
reinforcement  centers.  Therefore,  the  information  passed  to  the  motor  centers  ef¬ 
fects  the  selection  of  a  control  action  that  either  maximizes  the  estimate  of  e.xpect.ed 
discomited  future  reward,  or  minimizes  the  estimate  of  expected  discounted  future 
cost.  In  general,  a  two- layer  ACP  architecture  will  not  learn  the  optimal  policy. 
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7.2  ReconuueudAtioiOB  for  >\it:ure  Research 

7.1.4  Dbcussion  of  Differential  Dynamic  Programming 

For  several  reasons,  differential  dynamic  programming  (DDF)  is  an  inappro¬ 
priate  approach  for  solving  the  problem  described  in  §1.1.  First,  the  DDF  algorithm 
yields  a  control  policy  only  in  the  vicinity  of  the  nominally  optimal  trajectory.  Ex¬ 
tension  of  the  technique  to  construct  a  control  law  that  is  valid  throughout  the  state 
space  is  tractable  only  for  linear  systems  and  qiiiuiratic  cost  functionals.  Second,/ 
the  DDF  algorithm  explicitly  requires,  as  does  dynamic  programming,  an  accurate 
model  of  the  plant  dynamics.  Therefore,  for  lants  with  initially  unknown  dynamics, 
a  system  identifi^  atton  procedure  must  be  included.  The  coordination  of  the  DDF 
algorithm  with  a  learning  systems  that  incrementally  improves  the  system  model 
would  constitute  an  indirect  learning  optimal  controller.  Third,  since  the  quadratic 
approximations  are  valid  only  in  the  vicinity  of  the  nominal  state  and  control  trajec¬ 
tories,  the  DDF  algorithm  may  not  extend  to  stochastic  control  problems  for  which 
the  process  noise  is  significant.  Fourth,  similar  to  Newton’s  nonlinear  programming 
method,  the  original  DDF  algorithm  will  converge  to  a  globally  optimal  solution 
only  if  the  initial  state  trajectory  is  sufficiently  close  to  the  optimal  state  trajectory. 


7.2  Recommendations  for  Future  Research 

Several  aspects  of  this  research  warrant  additional  thought.  The  extension 
oi  direct  learning  methods  to  continuous  inputs  and  continuous  outputs  might  be 
?in  tunbitiou.5  exideavor.  Millington  [41]  addressed  this  issue  by  using  a  spatially 
localized  connectionist  /  Analog  Learning  Element  that  defined,  as  a  distributed 
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function  of  state,  a  continuous  probability  density  function  for  control  selection. 
The  learning  procedure  increased  the  probability  of  selecting  a  control  that  yielded, 
with  a  high  probability,  a  large  positive  reinforcement.  The  difficulty  of  generalizing 
the  Q  learning  algorithm  to  continuous  inputs  and  outputs  has  previously  been 
discussed. 

The  focus  of  indirect  learning  control  research  should  be  tow^ds  methods  of 

/ 

incremental  function  approximation.  The  accuracy  of  the  learned  Jacobian  of  the 
unmodeled  dynamics  critically  impacts  the  performance  of  indirect  learning  optimal 
control  laws.  The  selection  of  network  parameters  (e.g.  learning  rates,  the  number 
of  nodes,  and  the  influence  function  centers  and  spatial  decay  rates)  determines  how 
successfully  the  network  will  map  the  initially  unmodeled  dynamics.  The  procedure 
that  was  used  for  the  self  ctiou  of  parameters  was  primarily  heuristic.  Automation  of 
this  procedure  could  iinpro’/e  the  learning  performance  and  facilitate  the  control  law 
design  process.  Additionally,  indirect  learning  optimal  control  methods  should  be 
applied  to  problems  with  a  higher  dimension.  The  closed-loop  system  performance 
as  well  as  the  difficulty  of  the  control  law  design  process  should  be  compared  with 
a  gain-scheduled  linear  approach  to  control  law  design. 
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Differential  Dynamic  Programming 

/ 

A.l  Classical  Dynamic  Programming 

Differential  dynamic  programming  (DDP)  shares  many  features  with  the  clas¬ 
sical  dynamic  programming  (DP).  For  this  reason,  and  becanse  dynamic  program¬ 
ming  is  a  more  recognized  algorithm,  this  chapter  begins  with  a  summary  of  the 
dynamic  programming  algorithm.  R.  E.  Bellman  introduced  the  classical  dynamic 
prograunming  technique,  in  1957,  as  a  method  to  determine  the  control  function  that 
minimizes  a  performance  criterion  [33].  Dynamic  programming,  therefore,  serves  as 
an  alternative  to  the  calculus  of  variations,  and  the  associated  two-point  boundary 
value  problems,  for  determining  optimal  controls. 

Starting  from  the  set  of  state  and  time  pairs  whicli  satisfy  the  terminal  con¬ 
ditions,  the  dynamic  programming  algorithm  progresses  baeJeward  in  discrete  time. 
To  accomplish  the  necessary  minimizations,  dynamic  programming  requires  a  quan¬ 
tization  of  both  the  state  and  control  spaces.  At  each  discrete  state,  for  every  stage 
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in  time,  the  optimal  action  is  the  action  which  yields  the  minimum  cost  to  com¬ 
plete  the  problem.  Employing  the  principle  of  optimality,  this  completion  cost  from 
a  given  discrete  state,  for  a  particular  choice  of  acti  i,  equals  the  sum  of  the  cost 
associated  with  performing  that  action  and  the  minimum  cost  to  complete  the  prob¬ 
lem  from  the  resulting  state  [23].  equals  the  Tninimnm  cost  to  complete  a 

problem  from  state  x  and  discrete  time  f ,  g(x,  u,  t)  is  the  incremental  cost  func- 

/ 

tion,  where  «  is  the  control  vector,  and  T(x,  u,  t)  is  the  state  transition  function. 
Further,  define  a  mapping  from  the  state  to  the  optimal  controls,  5(x;  t)  =  u(t) 
where  u{t)  is  the  argument  that  minimizes  the  right  side  of  (A.l). 

^r(x.(0)  =  ™  +  «^;+i(2:(i(<).u(0><))]  (a.i) 

The  principle  of  optimality  substantially  increases  the  efficiency  of  the  dynamic 

programming  algorithm  to  construct  S{x]t)  v/ith  respect  to  an  exhaustive  search, 

and  is  described  by  Bellman  and  S.  E.  Dre3dus 

An  optimal  policy  has  the  property  that  whatever  the  initial 
state  and  initial  decision  are,  the  remaining  decisions  must 
constitute  an  optimal  policy  with  regard  to  the  state  resulting 
from  the  first  decision  [34]. 

The  backward  recursion  process  ends  with  the  complete  description  of  5(x;  i) 
for  all  states  and  for  t  =  N  —2,  ...  1,  where  N  is  the  final  time.  Given  the 

initial  state  x’(l)  ~  ifl),  (A. 2)  defines  the  forward  DP  recursion  step. 

U*(t)  =  S(x*;  t) 


{A. 2a) 
{A.2h) 
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Although  dynamic  programming  provides  a  general  approach  to  optimed  con¬ 
trol  problems,  including  the  optimed  control  of  nonlinear  systems  with  state  and 
control  constraints,  the  DP  algorithm  requires  substantial  data  storage  and  a  large 
number  of  minimizations.  The  substantial  data  storage  that  dynamic  program¬ 
ming  requires  restdts  from  the  inefficient  lookup  table  representation  of  and 
u*  at  each  quantized  state  smd  time;  each  item  of  data  is  represented  exactly  by 
a  imique  z^ljustable  parameter.  This  curse  of  dimensionality  also  existed  in  the 
direct  learning  algorithms.  Many  practical  problems,  having  fine  levels  of  state  and 
control  quantization,  require  a  continuous  functional  mapping,  for  which  a  single 
adjustable  parameter  encodes  information  over  some  region  of  the  input  space.  Ad¬ 
ditionally,  a  continuous  mapping  eliminates  the  necessity  to  interpolate  between 
discrete  grid  points  in  the  input  space  to  determine  the  appropriate  control  ac¬ 
tion  for  an  arbitrary  input.  A  learning  system  could  be  employed  to  perform  this 
function  approximation.  A  second  disadvantage  of  the  DP  algorithm  is  the  neces¬ 
sity  of  an  accurate  dynamic  model.  If  the  equations  of  motion  are  not  excurately 
known  a  priori,  explicit  system  identification  is  necessary  to  apply  any  dynamic 
programming  procedure.  The  coordination  of  the  DP  algorithm  with  a  learning 
system  that  incrementally  improves  the  system  model  would  constitute  an  indirect 
learning  optimal  controller. 

A. 2  Differential  Dynamic  ji'rogrammiiig 


Discrete  time  dift’erei) tial  dynamic  prc)grammi.ug,  introduced  by  I).  Q.  Mayne 
[35]  and  refined  by  D.  H.  Jacobson  and  Mayne  [36],  is  a  numeric  approximation  to 
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the  classical  dynamic  programming  algorithm  and  is,  therefore,  also  applicable  to 
#  nonlinear  discrete  time  optimal  control  problems.  ^  Starting  with  a  nominal  state 

trajectory  and  a  nominal  control  sequence,  the  DDP  algorithm  selects  neighbor- 
n  ing  trajectories  and  sequeiices  that  3deld  the  optimal  decrease  in  the  second-order 

approximation  to  the  cost  functional  J(u)  =■-  YIm  9{$.i  Mi  <)• 

The  differential  d3mamic  programming  class  of  algorithms  incorporates  fea¬ 
tures  of  both  dynamic  programming  and  the  calculus  of  variations.  Before  pre¬ 
senting  an  overview  of  the  basic  DDP  algorithm,  several  of  these  properties  will 
be  reviewed.  DDP  does  not  involve  the  discretization  of  state  and  control  spaces, 
which  dynamic  programming  requires.  Additionedly,  whercM  dynamic  program¬ 
ming  constructs  the  value  function  of  expected  future  cost  to  achieve  the  terminal 
conditions  for  each  discrete  state  and  each  stage  in  discrete  time,  DDP  constructs 
a  continuous  quadratic  approximation  to  the  value  function  for  all  states  near  the 
nominal  trajectoiy.  Finally,  DDP  solves  for  a  control  sequence  iteratively,  as  do 
many  .solution  techniques  for  the  two-point  boundary- value  problems  which  arise 
from  the  calculus  of  variations.  Bellman’s  algoiithm  (DP),  in  contrast,  generates  a 
control  policy  in  a  single  computationally  intensive  procedure. 

Each  iteration  of  the  differential  dynamic  programming  algorithm  consists  of 
two  phases:  a  backward  run  to  determine  ^u(x;t),  the  linear  policy  whi*ih  defines 
the  change  from  the  nominal  control  as  a  fiinction  of  state,  for  states  near  the  nom¬ 
inal,  and  a  forward  run  to  update  the  nominal  state  trajectory  and  nominal  control 
^  sequence  [37,38].  The  DDP  algorithm  requires  accurate  rnodeLs  of  the  incremental. 

^  Jacobson  and  Mayne  also  applied  the  differential  dynamic  programming  method  to 
continuous  time  .systems  [36]. 
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cost  function  g{x,u,t)  and  the  state  transition  function  TisLiU-yi)-  Furthermore, 
the  original  DDP  algorithm  requires  that  both  of  these  functions  are  twice  differ¬ 
entiable  with  respect  to  states  and  controls;  this  condition  is  relaxed  to  a  necesssury 
single  differentiability  in  several  modified  DDP  algorithms. 

The  following  development  of  the  original  DDP  algorithm  follows  primarily 
from  Yakowitz  [39].  The  nominal  control  sequence  u„  along  with  the  initial  state 
i(l)  defines  a  nominal  state  trajectory 

a.  =  !4n(2),  ■ .  •  UnfN))  (A.3a) 

£„  =  &(2), . . .  UN))  (A.Zb) 

The  backward  recursion  commences  at  the  final  decision  time,  N,  by  constructing 
a  quadratic  approximation  to  the  nominal  cost. 

LUyUyN)  =  QPlg(x,u,N)]  (AA) 

The  (?P[  ]  operator  selects  the  quadratic  and  linear,  but  not  the  constant,  terms  of 
the  Taylor’s  series  expansion  of  the  argument  about  the  nominal  state  and  control 
sequences.  A  first  order  necessary  condition  for  a  control  u*  to  minimize  L{x,u,  N) 
appears  in  (A. 5),  winch  can  be  solved  for  the  optimal  input. 


‘YuT(x,  u,  N)  ~  0 


(A.5) 


Y)  ^  (A. 6a) 
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6x(N)  =  xTiN)  -  (A.6b) 

The  optimal  value  function,  f(x,N)  =  mim*  [</(x,u,iV)],  is  also  approximated  by  a 
quadratic. 

V(x]  N)  =  L(x,  u{x,  N),  N)  (A.7) 

The  DDP  backward  calctilations  proceed  for  t  =  N—1,  N~2,  ...  1.  Assuming  that 
+  has  been  determined,  the  cost  attributed  to  the  current  stage  together 
with  the  optimal  subsequent  cost  to  achieve  the  terminal  conditions  is  represented 
by  X(i,ju,t). 

L(x,u,t)  -  QF[g(x,u,t)  +  V'CXU.u.Oit  +  1))  (A.8) 

The  necessary  condition  A^L(x,u,t)  =  0  yields  the  policy  for  the  incremental 
control. 

0  =  .a*  +  0t{x{t)  -  ^{t))  (A.9) 

m(£,  0  =  0  (A.io) 

The  expression  for  the  variation  in  control  (A. 9)  is  valid  for  any  state  x(t)  suffi¬ 
ciently  close  to  the  nominal  state  The  vector  and  matrix  1  <  t  <  AT, 

must  be  maintained  for  use  in  the  forward  stage  of  DDP.  The  optimal  value  funt  tion 
appeius  in  (A.  11). 

V{xj,t)  =  L{x,u{x,t)j)  (A. 11) 

The  forward  run  calculates  a  successor  control  .sequence  and  the  corresponding 
state  trajectory.  Given  r(l),  u’(i)  --  y„(l)  +  o j  by  (A. 9)  and  (A.  10).  Therefore, 
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A. 2  Differential  Dynamic  Programming 

x'’(2)  =  T(x(l), 1).  For  <  =  2,  3,  .. .  iV,  (A. 12)  defines  the  new  control  and 
state  sequences  which  become  the  nominal  values  for  ti  next  iteration. 

=  6u{x(t),  t)  -I-  u^(t)  (A.  12a) 

x%t  +  1)  =  Jiit),  t)  {A.  12b) 

The  reduction  of  required  computations,  which  differential  dynamic  program¬ 
ming  demonstrates  with  respect  to  conventional  mathematical  programming  algo¬ 
rithms,  is  most  noticeable  for  problems  with  many  state  and  control  variables  and 
many  stages  in  discrete  time.  Whereas  each  iteration  of  the  DDP  algorithm  involves 
solving  a  low  dimensional  problem  for  each  stage  in  time,  mathematical  program¬ 
ming  schemes  for  the  numerical  determination  of  an  optimal  trajectory  typically 
require  the  solution  of  a  single  high  dimensional  problem  for  each  iteration.  To 
quantify  this  relationship,  consider  the  problem  where  the  state  vector  is  a  member 
of  /T*,  the  control  vector  lies  in  iT”,  and  N  represents  the  number  of  stages  in 
discrete  time.  The  DDP  algorithm  inverts  N  matrices  of  order  m  ,  for  each  iter¬ 
ation;  the  computational  effort,  therefore,  grows  linearly  with  N  ^  The  method 
of  variation  of  extremals  provides  a  muneric  solution  to  two  point  boundary- value 
problems  [23].  A  single  iteration  of  Newton’s  method  for  determining  the  roots 
of  nonlinear  equations,  a  technique  for  implementing  the  variation  in  extremals 
algorithm,  in  contrast,  reipiire;?  rui  N  ■  rn  matrix  to  be  inverte<i;  the  cost  of  mi 
iteration,  therefore,  grows  in  proportion  to  [40].  F'tirthermore,  lioth  idgonthms 
are  quaclratically  convergent.  In  the  case  where  A  --  1,  how(  ver,  the  DDP  algo 
rithin  iuid  New'ton’s  method  define  identical  inereiaentid  iinprovenient.s  in  state  mid 
^  I'he  control  siH]uence  will  bo  in  "• 
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control  sequences.  [39]  Similar  computational  differences  exist  between  the  DDP 
algorithm  and  other  iterative  numerical  techniques  such  as  the  method  of  steepest 
descent  and  quasilinearizatiou  [23]. 
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Appendix  B 

f 


An  Analysis  of  the  AEO  Open-loop 
Dynamics 


This  axialysis  follows  directly  from  Parkinson  and  Smith  [9].  Equation  (2.12) 
may  be  written  in  the  form  of  an  ordinary  differential  equation  with  small  nonlmear 
damping. 

where  ^  =  nAi  1  (^-1) 

If  /i  =  0,  the  solution  is  a  harmonic  osciUator  with  a  constant  maximum  vibration 
amplitude  X  and  phase 

X  =  Xcos{t  +  <!,)  (B.2a) 

^  =  ~Xstn(T  +  4>)  (J3.26) 

cLt 

H  /i  is  non-zero  but  much  less  than  one  (  0  <  ^  C  1 )  the  solution  may  be  expressed 
by  a  series  expansion  in  p)owers  of  p. 

X  -  AVo5(r  -f  4>)  -f  fxgiiX,  r,  4>)  f  i.?g-j(X,  r,  4>)  +  ...  (B.3) 

*  All  qiiantitie.s  in  this  analysis  are  noudiniensional 


cPX 


dr^ 


r  +  ^  =  /^/  - 


fdX 


\  or 
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In  the  expaiision,  X  and  ^  are  slowly  vaiying  functions  of  r.  To  first-order,  this 
series  may  be  approximated  by  (B.2),  where  X  and  <f>  are  now  fimctions  of  r.  For 
slowly  varying  X  and  these  equations  predict  nearly  circular  trajectories  in  the 
phase  plane.  The  parsimeters  presented  in  §2.2  and  used  for  all  AEO  experiments 
do  not  strictly  satisfy  these  assumptions.  However,  the  analysis  provides  insights 
to  the  AEO  dynamics. 

Following  the  outline  presented  in  [9],  each  side  of  (B.l)  is  multiplied  by  X 
tind  the  algebra  is  mtinipulated. 


iX+X)X  =  fxXf(X) 

(BA) 

(X  +  X)X  ^  +  X-) 

(B.5) 

X^  +  X^=  X\os\t  f  ii>)  -i-  X^3in^{T  +  <t>)  =  X^ 

(B.6) 

nXf{X)  ~  -nXsin{T  +  <i>)f  ^-Xsm(r  -H 

(B.7) 

2  dr  ~  (  X^m(r  -|-  ^)) 

iB.8) 

That  X  varies  slowly  with  r  implies  that  the  cycle  period  is  small  compared  with 
the  time  intervals  during  which  appreciable  changes  lu  the  amplitude  occur.  There¬ 
fore,  an  average  of  the  behavior  over  a  single  period  ehminates  the  harmonics  and 
is  still  sufficient  for  the  purpose  of  examining  the  time  evolution  of  the  amplitude. 
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Recall  from  (2.12)  and  (B.l)  that  /(^)  is  given  by 


dr  ^ 


~  flOOO  (u  -  ™  -  (■^)  (1000~) 

)00  [  \  nAiJ  dr  \AiU J  dr' 

+  (^)  (1000^)®  - 


(B.IO) 


Therefore,  (B.9)  reduces  to 


dr 


+ 


(5.11) 


In  the  following  analysis,  let  R  represent  the  square  of  the  amplitude  of  the 
reduced  idbration,  i.e.  R  —  X*.  Equation  (B.ll)  may  immediately  be  rewritten  in 
terms  of  R. 

dR 


dr 


^aR-  bR?  +  cI^-dR* 


(5.12) 


Recalling  that  /i  <C  1,  stationary  oscillations  are  nearly  circular  and  correspond  to 
-—2  - _ 2 

constant  values  of  X  ;  constant  values  of  X  are  achieved  when 


dR 

dr 


=  0. 


(S.13) 


Tliis  condition  is  satisfied  by  i?  =  0  and  also  by  the  real,  positive  roots  of  the  cubic 
a  —  hR  +  cR^  -  dR!^  ~  0.  Negative  or  complex  values  for  the  squared  amplitude  of 
vibration  would  not  represent  physical  phenomena. 
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Stability  is  determined  by  the  tendency  of  the  oscillator  to  converge  or  diverge 
in  response  to  a  small  displacement  SR.  The  sign  of  ^  determines  the 

stability  of  the  focus  and  the  limit  cycles  and  will  be  positive,  negative,  or  zero  for 
unstable,  stable,  and  neutrally  stable  trajectories,  respectively. 

=  a -2hR-\-  ZcR}  -  Adie  (S.  14) 


A. 

dR  Vdr 


The  stability  of  the  foais  is  easily  analyzed. 


d 

dR 


=  a  =  nAi 


(i?.15) 


>  0  (B.16) 

Of=0 

Given  that  n,  17,  Ai,  A3,  As,  and  A7  are  positive,  the  coeflBcients  b,  c,  and  d 
will  also  be  positive.  If  =  0,  the  system  has  no  mechanical  damping  and  a  will 
be  positive  for  all  values  of  windspeed.  However,  if  >  0,  then  a  >  0  only  if 
U  >  Uc  ~  Therefore,  if  =  0  the  focus  is  unstable  for  all  windspceds  greater 
than  zero,  and  if  ^  >  0  the  focus  is  unstable  for  U  >  Uc-  This  minimum  airspeed 
for  oscillation  is  the  definition  of  the  reduced  critical  windspeed;  oscillation  can 
be  elimiu  ted  for  windspeeds  below  a  specific  value  by  suflficiently  increasing  the 
mechanical  damping. 

Tlu  ee  distinct  solutions  exist  when  a  >  0;  the  focus  is  unstable  for  each.  The 
choice  among  these  possibilities,  which  are  characterized  by  the  real  positive  roots 
of  the  cubic  a  —  bR-^cR^  —  dl^  —  0,  depends  upon  the  windspeed.  (1)  If  i2i  is  the 
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I'igure  B.l.  The  steady  state  amplitudes  of  osdJlation  versus 
the  incident  windspeed  U. 

single  real,  positive  root,  there  is  a  single  stable  limit  cycle  of  radius  y/Ri  around 
the  unstable  focus.  This  condition  exists  for  two  ranges  of  the  reduced  windspeed. 
(2)  Three  real,  distinct,  positive  roots,  R3  >  R2  >  Ri,  correspond  to  two  stable 
limit  cycles  at  y/Ri  and  y/R3  separated  by  an  unstable  limit  cycle  at  y/R^.  (3) 
Ri  and  coalesce  to  a  double,  real,  positive  root  while  R3  is  a.  distinct,  real, 
positive  root.  The  magnitude  of  the  radius  of  the  single  stable  limit  cycle  depends 
on  piior  state  information;  this  hystere.sij  is  discussed  below.  This  con<iition  occurs 
at  two  values  of  the  reduced  incident  windspeed. 

The  most  interesting  dynamics  occur  when  the  second  of  these  situations  ex¬ 
ists.  Figure  B.l  plots  the  steady  state  amplitude  of  oscillation  X„,  for  circtiiar 
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Squared  Oscillation  Amplitude,  R 
Figure  B.2.  ^  versus  R  fox  U  —  2766.5- 


limit  cycles,  as  a  function  of  incident  windspeed. 

A  hysteresis  in  X,,  can  be  demonstrated  by  increasing  the  reduced  airsi>eed 
from  U  <  Uc,  where  X„  =  0.  For  Uc  <  U  <  t/j,  the  amplitude  of  the  steady 
state  oscillation  will  correspond  to  the  inner  stable  limit  cycle;  for  U  >  U^-,  X,, 
jumps  to  the  larger  stable  limit  cycle.  As  the  dimensioxiless  windspeed  is  decretised 
from  U  >  U21  the  amplitude  of  the  steady  state  oscillation  will  remain  on  the 
outer  stable  limit  cycle  while  U  >  U\.^  When  the  windspeed  is  decreased  below 
U  =  Ui,  the  steady  state  amplitude  of  oscillation  decreases  to  the  iimer  stable  limit 
cycle,  Therefot'e,  for  a  cnnstant  windspeed  Ui  <  U  <  f/j,  Xa»  resides  on  the  innei- 
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stable  limit  cycle  when  the  initial  displacement  is  less  than  the  magnitude  of  the 
unstable  limit  cycle,  and  X,,  lies  on  the  outer  stable  limit  cycle  when  the  initial 
displacement  is  greater  than  the  magnitude  of  the  unstable  limit  cycle. 

For  a  specific  value  of  the  reduced  wind  velocity,  the  rate  of  change  of  the 
square  of  the  oscillation  amplitude,  ^ ,  can  be  plotted  against  the  square  of  the 
amplitude  of  oscillation  R  (Figure  B.2).  If  ^  is  positive,  the  oscillation  amplitude 
will  increase  with  time,  and  if  ^  is  negative  the  oscillation  amplitude  will  decrease 
with  time.  Therefore,  an  oscillation  amplitude  where  the  value  of  ^  crosses  from 
positive  to  negative  with  increasing  i2  is  a  stable  amplitude.  The  focxis  will  be 
stable  when  the  time  rate  of  change  of  oscillation  amplitude  is  negative  for  R 
slightly  greater  than  zero. 
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